openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
127 stars 37 forks source link

Gutenberg with python2 dies with a new error #54

Closed kelson42 closed 6 years ago

kelson42 commented 6 years ago
$ ./gutenberg2zim
CHECKING for dependencies on the system
PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
    Downloading http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 into rdf-files.tar.bz2
curl --fail --insecure --location --silent --show-error -C - --url http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 --output rdf-files.tar.bz2
    Extracting rdf-files.tar.bz2 into rdf-files
tar -C rdf-files --strip-components 2 -x -f rdf-files.tar.bz2
Setting up the database
Created table for license
Loading fixtures for license
[fixtures] Created <License: 'Public domain in the USA.'>
[fixtures] Created <License: 'None'>
[fixtures] Created <License: 'Copyrighted. Read the copyright notice inside this book for details.'>
Created table for format
Loading fixtures for format
Created table for author
Loading fixtures for author
[fixtures] Created <Author: 'Various'>
[fixtures] Created <Author: 'Anonymous'>
Created table for book
Loading fixtures for book
Created table for bookformat
Loading fixtures for bookformat
PARSING rdf-files in rdf-files
    Looping throught RDF files in rdf-files
    Parsing file rdf-files/1/pg1.rdf
    Parsing file rdf-files/1766/pg1766.rdf
    Parsing file rdf-files/2648/pg2648.rdf
    Parsing file rdf-files/4412/pg4412.rdf
    Parsing file rdf-files/5294/pg5294.rdf
    WARN: Unusable book without any information 1766
    Parsing file rdf-files/1767/pg1767.rdf
    Parsing file rdf-files/6176/pg6176.rdf
    Parsing file rdf-files/7940/pg7940.rdf
    Parsing file rdf-files/883/pg883.rdf
    Parsing file rdf-files/3530/pg3530.rdf
    Parsing file rdf-files/8822/pg8822.rdf
    WARN: Unusable book without any information 1767
    Parsing file rdf-files/9704/pg9704.rdf
    Parsing file rdf-files/1768/pg1768.rdf
    Parsing file rdf-files/10586/pg10586.rdf
    Parsing file rdf-files/11468/pg11468.rdf
    Parsing file rdf-files/12350/pg12350.rdf
    Parsing file rdf-files/13232/pg13232.rdf
    Parsing file rdf-files/7058/pg7058.rdf
    Parsing file rdf-files/12351/pg12351.rdf
    Parsing file rdf-files/2/pg2.rdf
    Parsing file rdf-files/9705/pg9705.rdf
    Parsing file rdf-files/7941/pg7941.rdf
    Parsing file rdf-files/8823/pg8823.rdf
    Parsing file rdf-files/8824/pg8824.rdf
    Parsing file rdf-files/3531/pg3531.rdf
    Parsing file rdf-files/3532/pg3532.rdf
    Parsing file rdf-files/11469/pg11469.rdf
    Parsing file rdf-files/10587/pg10587.rdf
    Parsing file rdf-files/3533/pg3533.rdf
    Parsing file rdf-files/6177/pg6177.rdf
    Parsing file rdf-files/4413/pg4413.rdf
    Parsing file rdf-files/11470/pg11470.rdf
    Parsing file rdf-files/6178/pg6178.rdf
    Parsing file rdf-files/12352/pg12352.rdf
    Parsing file rdf-files/6179/pg6179.rdf
    Parsing file rdf-files/6180/pg6180.rdf
    Parsing file rdf-files/11471/pg11471.rdf
    Parsing file rdf-files/7942/pg7942.rdf
    Parsing file rdf-files/12353/pg12353.rdf
Traceback (most recent call last):
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 3161, in get
    Parsing file rdf-files/14114/pg14114.rdf
    return next(clone.execute())
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 2326, in next
    obj = self.iterate()
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 2308, in iterate
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/data/gutenberg/gutenbergtozim/rdf.py", line 194, in save_rdf_in_database
    author_record = Author.get(gut_id=parser.author_id)
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 4900, in get
    return sq.get()
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 3165, in get
    % self.sql())
gutenbergtozim.database.AuthorDoesNotExist: Instance matching query does not exist:
SQL: SELECT "t1"."gut_id", "t1"."last_name", "t1"."first_names", "t1"."birth_year", "t1"."death_year" FROM "author" AS t1 WHERE ("t1"."gut_id" = ?)
PARAMS: ['37']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/data/gutenberg/lib/python3.5/site-packages/playhouse/apsw_ext.py", line 107, in execute_sql
    self._execute_sql(cursor, sql, params)
  File "/media/data/gutenberg/lib/python3.5/site-packages/playhouse/apsw_ext.py", line 100, in _execute_sql
    cursor.execute(sql, params or ())
  File "src/cursor.c", line 236, in resetcursor
apsw.ConstraintError: ConstraintError: UNIQUE constraint failed: author.gut_id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./gutenberg2zim", line 195, in <module>
    main(docopt(help, version=0.1))
  File "./gutenberg2zim", line 153, in main
    concurrency=CONCURRENCY, force=FORCE)
  File "/media/data/gutenberg/gutenbergtozim/rdf.py", line 82, in parse_and_fill
    Pool(concurrency).map(ppf, fpaths)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/media/data/gutenberg/gutenbergtozim/rdf.py", line 81, in <lambda>
    ppf = lambda x: parse_and_process_file(x, force)
  File "/media/data/gutenberg/gutenbergtozim/rdf.py", line 105, in parse_and_process_file
    save_rdf_in_database(parser)
  File "/media/data/gutenberg/gutenbergtozim/rdf.py", line 201, in save_rdf_in_database
    death_year=parser.death_year)
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 4889, in create
    inst.save(force_insert=True)
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 5082, in save
    pk_from_cursor = self.insert(**field_dict).execute()
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 3506, in execute
    cursor = self._execute()
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 2892, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/media/data/gutenberg/lib/python3.5/site-packages/playhouse/apsw_ext.py", line 107, in execute_sql
    self._execute_sql(cursor, sql, params)
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 3578, in __exit__
    reraise(new_type, new_type(*exc_args), traceback)
  File "/media/data/gutenberg/lib/python3.5/site-packages/peewee.py", line 135, in reraise
    raise value.with_traceback(tb)
  File "/media/data/gutenberg/lib/python3.5/site-packages/playhouse/apsw_ext.py", line 107, in execute_sql
    self._execute_sql(cursor, sql, params)
  File "/media/data/gutenberg/lib/python3.5/site-packages/playhouse/apsw_ext.py", line 100, in _execute_sql
    cursor.execute(sql, params or ())
  File "src/cursor.c", line 236, in resetcursor
peewee.IntegrityError: ConstraintError: UNIQUE constraint failed: author.gut_id
dattaz commented 6 years ago

I just see i'm assign on this... I think you still have data from previous run, maybe rm gutenberg.db.

(i can't reproduce your bug)