Why not use re2 to replace re?

Ryuchen commented 7 years ago

I have a file running PDF parsing too long.

Traceback (most recent call last): File "/home/soft/HawkEye/utils/../lib/hawkeye/core/plugins.py", line 230, in process data = current.run() File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1860, in run static = PDF(self.file_path).run() File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1080, in run results = self._parse(self.file_path) File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 882, in _parse ret, self.pdf = PDF_parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=True) File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7035, in parse rawIndirectObjects = self.getIndirectObjects(bodyContent, looseMode) File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7792, in getIndirectObjects matchingObjectsAux = regExp.findall(content) KeyboardInterrupt

And I find that i maybe RE problem, so why not use re2 to replace re?

After I replace it , I run very fast!

jesparza commented 7 years ago

Hi @Ryuchen!

Thanks for the suggestion. Do you know if there is a pip installation of re2? Just looking at the repo I don't see it there. Also, taking a look at the repo documentation it says there is no findall function nor flags, right? Could you share the changes you did to use re2 with peepdf and the test timings?

Thanks a lot!

Ryuchen commented 7 years ago

https://github.com/axiak/pyre2

Just use above repo. And use this code:

try:
    import re2 as re
except ImportError:
    import re

jesparza commented 7 years ago

Mmm, it looks good, I will take a look, thanks!!

Titotix commented 7 years ago

It might be an issue for peepdf : Quoted from re2 documentation :

Note: The re2 module treats byte strings as UTF-8. This is fully backwards compatible with 7-bit ascii. However, bytes containing values larger than 0x7f are going to be treated very differently in re2 than in re. The RE library quietly ignores invalid utf8 in input strings, and throws an exception on invalid utf8 in patterns. For example:

>>> re.findall(r'.', '\x80\x81\x82')
['\x80', '\x81', '\x82']
>>> re2.findall(r'.', '\x80\x81\x82')
[]

If you require the use of regular expressions over an arbitrary stream of bytes, then this library might not be for you.

jesparza / peepdf

Why not use re2 to replace re? #62