jesparza / peepdf

Powerful Python tool to analyze PDF documents
http://peepdf.eternal-todo.com
GNU General Public License v3.0
1.28k stars 241 forks source link

Why not use re2 to replace re? #62

Open Ryuchen opened 7 years ago

Ryuchen commented 7 years ago

https://github.com/facebook/pyre2

I have a file running PDF parsing too long.

Traceback (most recent call last): File "/home/soft/HawkEye/utils/../lib/hawkeye/core/plugins.py", line 230, in process data = current.run() File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1860, in run static = PDF(self.file_path).run() File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1080, in run results = self._parse(self.file_path) File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 882, in _parse ret, self.pdf = PDF_parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=True) File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7035, in parse rawIndirectObjects = self.getIndirectObjects(bodyContent, looseMode) File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7792, in getIndirectObjects matchingObjectsAux = regExp.findall(content) KeyboardInterrupt

And I find that i maybe RE problem, so why not use re2 to replace re?

After I replace it , I run very fast!

jesparza commented 7 years ago

Hi @Ryuchen!

Thanks for the suggestion. Do you know if there is a pip installation of re2? Just looking at the repo I don't see it there. Also, taking a look at the repo documentation it says there is no findall function nor flags, right? Could you share the changes you did to use re2 with peepdf and the test timings?

Thanks a lot!

Ryuchen commented 7 years ago

https://github.com/axiak/pyre2

Just use above repo. And use this code:

try:
    import re2 as re
except ImportError:
    import re
jesparza commented 7 years ago

Mmm, it looks good, I will take a look, thanks!!

Titotix commented 7 years ago

It might be an issue for peepdf : Quoted from re2 documentation :

Note: The re2 module treats byte strings as UTF-8. This is fully backwards compatible with 7-bit ascii. However, bytes containing values larger than 0x7f are going to be treated very differently in re2 than in re. The RE library quietly ignores invalid utf8 in input strings, and throws an exception on invalid utf8 in patterns. For example:

>>> re.findall(r'.', '\x80\x81\x82')
['\x80', '\x81', '\x82']
>>> re2.findall(r'.', '\x80\x81\x82')
[]

If you require the use of regular expressions over an arbitrary stream of bytes, then this library might not be for you.