raduangelescu / gutenbergpy

Gutenberg cache and query library
MIT License
35 stars 16 forks source link

Problem processing book id 14419 #19

Open s-moon opened 5 months ago

s-moon commented 5 months ago

I am using a Mac (Macbook Pro, 2023) with Python 3.11. I've installed the package in a virtual environment, with pip.

When trying to populate the cache:

from gutenbergpy.gutenbergcache import GutenbergCache
#for sqlite
GutenbergCache.create()

I see this error:

Deleting old files
 Downloading rdf-files.tar.bz2 : [####################]took 7.320369
 Extracting  rdf-files.tar.bz2 : [###################]took 50.328341
 Processing progress: 14419 / 73217 : [###                ]Traceback (most recent call last):
  File "/Users/sm/github/gutenbergpy/gtpy/cache.py", line 3, in <module>
    GutenbergCache.create()
  File "/Users/sm/github/gutenbergpy/gtpy/lib/python3.11/site-packages/gutenbergpy/gutenbergcache.py", line 62, in create
    result = parser.do()
             ^^^^^^^^^^^
  File "/Users/sm/github/gutenbergpy/gtpy/lib/python3.11/site-packages/gutenbergpy/parse/rdfparser.py", line 44, in do
    doc = etree.parse(file_path,etree.ETCompatXMLParser())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3569, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1952, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1978, in lxml.etree._parseDocumentFromURL
  File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseDocFromFile
  File "src/lxml/parser.pxi", line 1200, in lxml.etree._BaseParser._parseDocFromFile
  File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 670, in lxml.etree._raiseParseError
OSError: Error reading file 'cache/epub/test/pgtest.rdf': failed to load external entity "cache/epub/test/pgtest.rdf"

I can ignore that book by editing parse/rdfparser.py with:

Line 42: 
if idx == 14419:
                continue

In ignoring the problem, the rest of the books are fine.

I believe there's perhaps a bad link in the rdf that it isn't happy about in the cache but I've taken a look at the file, and nothing jumps out. I did find this stackoverflow post, however:

https://stackoverflow.com/a/10457801

which I suspect may be relevant.

Any ideas?

Thanks, Stephen

adrian-chen commented 4 months ago

If you look at the file in the test directory, it doesn't match the name of the containing folder:

There is a file in the cache: cache/epub/test/pg11.rdf. Because of this assumption that the filename will always match it's containing folder, this breaks, causing processing to fail.