pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.97k stars 932 forks source link

Figure text should not be in the output when all_texts=False #455

Open pietermarsman opened 4 years ago

pietermarsman commented 4 years ago

Bug report

all_texts should control whether text is extracted from figures (at least according to the docs).

Expected Behaviour (at least by my understanding):

Actual Behaviour:

:heavy_check_mark: When all_texts=True the text is extracted and layout analysis is performed. :no_entry: When all_texts=False, layout analysis is not performed (which is fine), but the text still appears in the output (which is wrong).

I think the actual issue should be "Figure text should not be in the output when all_texts=False".

Note: I've seen some PDFs where ALL the text is inside a figure according the to the structure. In this case, the output would appear blank unless you pass all_texts=True. This is probably unavoidable, but we'd have to watch for this case if people submit tickets about there being no output.

jstockwin commented 4 years ago

Hm, actually @pietermarsman this isn't quite what I meant. Just to enhance on my comment and hopefully make it clearer:

all_texts should control whether text is extracted from figures (at least according to the docs).

Expected Behaviour (at least by my understanding):

Actual Behaviour:

:heavy_check_mark: When all_texts=True the text is extracted and layout analysis is performed. ❌ When all_texts=False, layout analysis is not performed (which is fine), but the text still appears in the output (which is wrong).

I think the actual issue should be "Figure text should not be in the output when all_texts=False".

Note: I've seen some PDFs where ALL the text is inside a figure according the to the structure. In this case, the output would appear blank unless you pass all_texts=True. This is probably unavoidable, but we'd have to watch for this case if people submit tickets about there being no output.

pietermarsman commented 4 years ago

Ah, oops :facepalm: I thought I was being proactive :wink: I've changed the title and the description to you explanation.

jstockwin commented 4 years ago

No worries, I should have just opened an issue :)