openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Database locked #169

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

@benoit74, latest run encountered a new DB lock situation. Could you take a look?

HTTP request sent, awaiting response... 200 OK
Length: 8757 (8.6K) [image/jpeg]
Saving to: 'dl-cache/68889/unoptimized/68889_cover_image.jpg'

     0K ........                                              100% 87.6M=0s

2023-02-03 04:16:35 (87.6 MB/s) - 'dl-cache/68889/unoptimized/68889_cover_image.jpg' saved [8757/8757]

--2023-02-03 04:16:35--  http://dante.pglaf.org/cache/epub/68890/pg68890.epub
Resolving dante.pglaf.org (dante.pglaf.org)... 65.50.255.18, 2604:3200:0:3:21e:67ff:fe8b:3834
Connecting to dante.pglaf.org (dante.pglaf.org)|65.50.255.18|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 453533 (443K)
Saving to: 'dl-cache/68890/unoptimized/68890.epub'

     0K .......... .......... .......... .......... .......... 11%  152K 3s
    50K .......... .......... .......... .......... .......... 22%  304K 2s
   100K .......... .......... .......... .......... .......... 33% 69.9M 1s
   150K .......... .......... .......... .......... .......... 45%  496K 1s
   200K .......... .......... .......... .......... .......... 56%  782K 1s
   250K .......... .......... .......... .......... .......... 67%  105M 0s
   300K .......... .......... .......... .......... .......... 79%  502K 0s
   350K .......... .......... .......... .......... .......... 90% 63.5M 0s
   400K .......... .......... .......... .......... ..        100%  678K=0.8s

2023-02-03 04:16:37 (538 KB/s) - 'dl-cache/68890/unoptimized/68890.epub' saved [453533/453533]

--2023-02-03 04:16:41--  http://dante.pglaf.org/cache/epub/68890/pg68890.cover.medium.jpg
Resolving dante.pglaf.org (dante.pglaf.org)... 65.50.255.18, 2604:3200:0:3:21e:67ff:fe8b:3834
Connecting to dante.pglaf.org (dante.pglaf.org)|65.50.255.18|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20936 (20K) [image/jpeg]
Saving to: 'dl-cache/68890/unoptimized/68890_cover_image.jpg'

     0K .......... ..........                                 100%  128K=0.2s

2023-02-03 04:16:41 (128 KB/s) - 'dl-cache/68890/unoptimized/68890_cover_image.jpg' saved [20936/20936]

Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 4, in <module>
    __import__('pkg_resources').run_script('gutenberg2zim==1.1.9', 'gutenberg2zim')
  File "/usr/local/lib/python3.11/site-packages/pkg_resources/__init__.py", line 672, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.11/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-1.1.9-py3.11.egg/EGG-INFO/scripts/gutenberg2zim", line 236, in <module>
    main(docopt(help, version=VERSION))
  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-1.1.9-py3.11.egg/EGG-INFO/scripts/gutenberg2zim", line 186, in main
    download_all_books(
  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-1.1.9-py3.11.egg/gutenbergtozim/download.py", line 394, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-1.1.9-py3.11.egg/gutenbergtozim/download.py", line 390, in dlb
    return download_book(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-1.1.9-py3.11.egg/gutenbergtozim/download.py", line 304, in download_book
    book.save()
  File "/usr/local/lib/python3.11/site-packages/peewee.py", line 6740, in save
    rows = self.update(**field_dict).where(self._pk_expr()).execute()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/peewee.py", line 1962, in inner
    return method(self, database, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/peewee.py", line 2033, in execute
    return self._execute(database)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/peewee.py", line 2551, in _execute
    cursor = database.execute(self)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/peewee.py", line 3250, in execute
    return self.execute_sql(sql, params, commit=commit)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/playhouse/apsw_ext.py", line 133, in execute_sql
    cursor.execute(sql, params or ())
  File "src/cursor.c", line 235, in resetcursor
apsw.BusyError: BusyError: database is locked
benoit74 commented 1 year ago

It seems a bit weird because everything was working fine until around the book ID 68890, meaning that lots of stuff has already been done successfully.

Looking at SQLITE documentation, it looks like since we are in WAL mode, SQLITE_BUSY error should never happen. Except that they might happen sometimes. See https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

Since we are not managing this condition at all today + it is documented that it might happen under some rare condition (which we meet from my point of view due to the multiprocessing which is opening / closing DB connection for each book) + it happens very rarely (only once every 50000 books, more or less), I would suggest to first try to fix this issue with a simple constant backoff (i.e. if an apsw.BusyError happens, just retry, we might be in a corner case).

Does it makes any sense from your experience?