kcroker / dpsprep

Python DJVU to PDF converter which preserves OCR text and bookmark metadata (e.g. TOC)
Other
193 stars 16 forks source link

ValueError: invalid literal for int() with base 10 #23

Open mortang2410 opened 12 months ago

mortang2410 commented 12 months ago

I was trying to convert this djvu file to pdf.

Gerald_B._Folland-Real_Analysis__Modern_Techniques_and_Their_Applications,_2nd_Ed.djvu.zip

And I got this error. Do you know what's going wrong? I'm on Mac OS using Python 3.11.6, pip 23.3.1, Poetry 1.7.1, and dpsprep 2.2.2.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/__main__.py", line 3, in <module>
    dpsprep()
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/dpsprep.py", line 168, in dpsprep
    outline = OutlineTransformVisitor().visit(document.outline.sexpr)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 34, in visit
    return self.visit_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 12, in visit_list
    return method(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/outline.py", line 45, in visit_list_bookmarks
    self.visit(child, parent=outline)
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 34, in visit
    return self.visit_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 15, in visit_list
    return self.visit_plain_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/outline.py", line 13, in visit_plain_list
    page_number = int(page.value[1:]) - 1
                  ^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'f007.djvu'
v-- commented 12 months ago

@mortang2410 The error says that we cannot determine the page number given its title. I found no way to determine a page's title using python-djvulibre. It is certainly possible, but as far as I understand it is something that only actively maintained libdjvu bindings support (or maybe not?).

Anyway, just for the sake of not raising exceptions, in 2.2.3 warnings are logged instead.

PS: This is not the only problem encountered during conversion of that particular DjVu file. It is a popular book, perhaps there are better DjVu/PDF files over the internet?

mortang2410 commented 12 months ago

With the new commit, I have been able to convert the djvu file successfully, but bookmarks / outline is gone. Yeah, it's not a huge problem as I can find other replacements. I just wanted to report this strange error message.

v-- commented 12 months ago

I tried to explain in my previous comment that the DjVu file uses "page titles", a relatively obscure feature that the libdjvu bindings we use does not support. I know of no way to generate the outline without some work into our dependencies.

This issue can remain open. Somebody has to volunteer to work on DjVuLibre and its bindings in order to fix it, however.