ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.11k stars 1.02k forks source link

Refactor code that tramples previous variable assignment (was: possible bug) #997

Open kshpytsya opened 2 years ago

kshpytsya commented 2 years ago

I may be missing something but it seems that the value of self._has_text set in this section of code:

https://github.com/ocrmypdf/OCRmyPDF/blob/5c6030960945fe299291fa134cff35c86a644b9f/src/ocrmypdf/pdfinfo/info.py#L779-L788

is always overwritten here:

https://github.com/ocrmypdf/OCRmyPDF/blob/5c6030960945fe299291fa134cff35c86a644b9f/src/ocrmypdf/pdfinfo/info.py#L804-L822

jbarlow83 commented 2 years ago

Kind of? This ambiguity is not intended, but it looks to me like it's going to end up doing the right thing anyway.

If the first if is executed, then we're going to get the same result in the second if for self._has_text too, unless there are bugs in either implementation. The former uses pdfminer.six to do a much slower analysis that figures out where all the text is on the page; the latter checks for presence of vector, text and images. So the result of that boolean is going to be equal in both cases.

If you have something that demonstrates a likely issue here please let me know.