Open pietermarsman opened 4 years ago
Hm, actually @pietermarsman this isn't quite what I meant. Just to enhance on my comment and hopefully make it clearer:
all_texts
should control whether text is extracted from figures (at least according to the docs).
Expected Behaviour (at least by my understanding):
all_texts=True
should extract text from figures, performing layout analysis.all_texts=False
should ignore text in figures entirely, and so such text should not be in the output at all.Actual Behaviour:
:heavy_check_mark: When all_texts=True
the text is extracted and layout analysis is performed.
❌ When all_texts=False
, layout analysis is not performed (which is fine), but the text still appears in the output (which is wrong).
I think the actual issue should be "Figure text should not be in the output when all_texts=False
".
Note: I've seen some PDFs where ALL the text is inside a figure according the to the structure. In this case, the output would appear blank unless you pass all_texts=True
. This is probably unavoidable, but we'd have to watch for this case if people submit tickets about there being no output.
Ah, oops :facepalm: I thought I was being proactive :wink: I've changed the title and the description to you explanation.
No worries, I should have just opened an issue :)
Bug report
all_texts
should control whether text is extracted from figures (at least according to the docs).Expected Behaviour (at least by my understanding):
all_texts=True
should extract text from figures, performing layout analysis.all_texts=False
should ignore text in figures entirely, and so such text should not be in the output at all.Actual Behaviour:
:heavy_check_mark: When
all_texts=True
the text is extracted and layout analysis is performed. :no_entry: Whenall_texts=False
, layout analysis is not performed (which is fine), but the text still appears in the output (which is wrong).I think the actual issue should be "Figure text should not be in the output when
all_texts=False
".Note: I've seen some PDFs where ALL the text is inside a figure according the to the structure. In this case, the output would appear blank unless you pass
all_texts=True
. This is probably unavoidable, but we'd have to watch for this case if people submit tickets about there being no output.