johnlinp / pdf-to-markdown

Convert PDF files into markdown files
BSD 3-Clause "New" or "Revised" License
284 stars 70 forks source link

Support Python 3.x #19

Open johnlinp opened 5 years ago

johnlinp commented 5 years ago

Python 2 is going to be deprecated; let's support Python 3.x.

johnlinp commented 5 years ago

Some issues were pointed out in https://github.com/johnlinp/pdf-to-markdown/issues/17#issuecomment-509132956

nidhi-wgl commented 4 years ago

converted existing code base to python3 using 2to3 and installed the dist and tried running. It gives an error

Traceback (most recent call last):
  File "/usr/local/bin/pdf2md", line 4, in <module>
    __import__('pkg_resources').run_script('pdf-to-markdown==0.1.0', 'pdf2md')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1469, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/EGG-INFO/scripts/pdf2md", line 32, in <module>
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/EGG-INFO/scripts/pdf2md", line 27, in main
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/writer.py", line 27, in write
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/writer.py", line 50, in _write_simple
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/pile.py", line 74, in gen_markdown
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/pile.py", line 266, in _gen_paragraph_markdown
  File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/syntax.py", line 47, in pattern
  File "/usr/lib/python3.7/re.py", line 183, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

i thought maybe something with re.match or re.search but i guess the content is not getting as string but as bytes format. some encoding and decode issue when parsing with only english text also.

TypeError: can only concatenate str (not "bytes") to str

I just was hoping to inform about error nothing else, i might try to work on it when i have some time

nidhi-wgl commented 4 years ago

i am not sure if this is correct way to do it but .decode(encoding="utf-8") fixes it and the extension works perfect with all files including the example file in repo.

johnlinp commented 4 years ago

Hi @nidhi-wgl,

According to @nella17's PR (#22), we can see that simply removing the .encode('utf8') part should work. Please see https://github.com/johnlinp/pdf-to-markdown/pull/22/commits/6791abf93da7c2aa79ab3e7cd4ae87957bcae271.

Thanks @nella17!

nidhi-wgl commented 4 years ago

yeah, that is also one way around. I didn't want to remove .encode or any exiting code so I was proposing to add the decode line if anyone wanted to run the code in python3.