openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
128 stars 37 forks source link

Exporting crash, "peewee.OperationalError: too many SQL variables" #8

Closed kelson42 closed 9 years ago

kelson42 commented 10 years ago

command: ./dump-gutenberg.py -l fr,es -f pdf,epub

log:

GET /4/6/3/1/46314/46314-h.htm HTTP/1.1" 404 252
EXPORTING ebooks to satic folder (and JSON)
        Filtered book collection size: 2808
        Filtered book collection, PDF: 45
        Filtered book collection, ePUB: 2778
        Filtered book collection, HTML: 2786
                Dumping full_by_popularity.js
                Dumping full_by_title.js
                Dumping lang_fr_by_popularity.js
                Dumping lang_fr_by_title.js
                Dumping authors_lang_fr.js
                Dumping lang_es_by_popularity.js
                Dumping lang_es_by_title.js
                Dumping authors_lang_es.js
Traceback (most recent call last):
  File "./dump-gutenberg.py", line 150, in <module>
    main(docopt(help, version=0.1))
  File "./dump-gutenberg.py", line 137, in main
    only_books=BOOKS)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 96, in export_all_books
    formats=formats)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 576, in export_to_json_helpers
    for author in authors:
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2139, in __iter__
    return iter(self.execute())
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2132, in execute
    self._qr = ResultWrapper(model_class, self._execute(), query_meta)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
    self.commit()
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in __exit__
    reraise(new_type, new_type(*exc_value.args), traceback)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables
Seb35 commented 10 years ago

I had this problem on my computer too for large selections like you have here, but it worked on Renaud’s MacBook probably more powerful. Anyway this is a performance bug which should be resolved.

kelson42 commented 9 years ago

I'm stuck on the server with the EN export at this bug. Could someone please fix it urgently?

kelson42 commented 9 years ago

There is a problem/regression!

Assuming we make a ZIM file with only --languages=fr then authors.js = authors_lang_fr.js = contains all authors. Both should only contain french speak authors (like before).

kelson42 commented 9 years ago

In addition, by doing a full export (all books of Gutenberg)... we still have the problem:

$ rm -rf static/ ; ./dump-gutenberg.py --keep-db --export --zim
'list' object has no attribute 'split'
EXPORTING ebooks to static folder (and JSON)
        Filtered book collection size: 45468
        Filtered book collection, PDF: 960
        Filtered book collection, ePUB: 45331
        Filtered book collection, HTML: 45299
                Dumping full_by_popularity.js
                Dumping full_by_title.js
                Dumping lang_en_by_popularity.js
                Dumping lang_en_by_title.js
                Dumping authors_lang_en.js
Traceback (most recent call last):
  File "./dump-gutenberg.py", line 150, in <module>
    main(docopt(help, version=0.1))
  File "./dump-gutenberg.py", line 137, in main
    only_books=BOOKS)
  File "/media/data/gutenberg/gutenberg/export.py", line 107, in export_all_books
    formats=formats)
  File "/media/data/gutenberg/gutenberg/export.py", line 609, in export_to_json_helpers
    Author.first_names.asc())],
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2139, in __iter__
    return iter(self.execute())
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2132, in execute
    self._qr = ResultWrapper(model_class, self._execute(), query_meta)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
    self.commit()
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in __exit__
    reraise(new_type, new_type(*exc_value.args), traceback)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables
rashiq commented 9 years ago

Hi @kelson42

Hmm this is weird. I can't reproduce the bug with ./dump-gutenberg.py --keep-db --export --zim. Everything seems to work fine. I do get a crash with ./dump-gutenberg.py -l fr,es -f pdf,epub though.

I'm investigating.

kelson42 commented 9 years ago

You should be able at least to reproduce the problem with the authors list. The other one is a performance problem so related to your HW I guess.

rashiq commented 9 years ago

Just tried to run ./dump-gutenberg.py -l te -f pdf,epub (-l te instead of -l fr,es, because there are way to many books for french and spanish than for telugu for example, and I just wanted to quickly test it - but in theory both should work the same) and it worked. I didn't get the error :/

kelson42 commented 9 years ago

$ rm -rf static/ ; ./dump-gutenberg.py -l te --export ; ls -la static/authors*.js 'list' object has no attribute 'split' EXPORTING ebooks to static folder (and JSON) Filtered book collection size: 5 Filtered book collection, PDF: 0 Filtered book collection, ePUB: 5 Filtered book collection, HTML: 5 Dumping full_by_popularity.js Dumping full_by_title.js Dumping lang_te_by_popularity.js Dumping lang_te_by_title.js Dumping authors_lang_te.js Dumping auth_39781_by_popularity.js Dumping auth_39781_by_title.js Dumping authors.js Dumping languages.js Dumping main_languages.js Exporting Book #39004. Exporting to static/శుభలేఖ.39004.html Copying format file to శుభలేఖ.39004.epub Creating ePUB at /tmp/tmpEm5Vxp.epub Exporting to static/శుభలేఖ_cover.39004.html Exporting Book #39561. Exporting to static/అగ్నిగుండం.39561.html Copying format file to అగ్నిగుండం.39561.epub Creating ePUB at /tmp/tmpaUemp0.epub Exporting to static/అగ్నిగుండం_cover.39561.html Exporting Book #40687. Exporting to static/కొల్లాయి గట్టితే నేమి?.40687.html Copying format file to కొల్లాయి గట్టితే నేమి?.40687.epub Creating ePUB at /tmp/tmp3IGrp9.epub Exporting to static/కొల్లాయి గట్టితే నేమి?_cover.40687.html Exporting Book #41845. Exporting to static/ఓనమాలు.41845.html Copying format file to ఓనమాలు.41845.epub Creating ePUB at /tmp/tmpbbOEs2.epub Exporting to static/ఓనమాలు_cover.41845.html Exporting Book #43220. Exporting to static/కత్తుల వంతెన.43220.html Copying format file to కత్తుల వంతెన.43220.epub Creating ePUB at /tmp/tmpHxlRgS.epub Exporting to static/కత్తుల వంతెన_cover.43220.html [] -rw-rw-r-- 1 kelson kelson 524023 Sep 26 00:51 static/authors.js

-rw-rw-r-- 1 kelson kelson 524023 Sep 26 00:51 static/authors_lang_te.js

Do you seriously believe they are for 500KB of authors in Telegu? ;) Or you get an other result?

rashiq commented 9 years ago

Yeah I chose the language telugu because there are so few books in telugu (so I don't have to download for an hour to reproduce the error):

$ du -h static/authors.js 
516K    static/authors.js

There are only 5 books in telugu:

sqlite> SELECT * FROM book WHERE language = 'te';
39004|శుభలేఖ||39781|PD|te|31
39561|అగ్నిగుండం||39781|PD|te|40
40687|కొల్లాయి గట్టితే నేమి?||39781|PD|te|19
41845|ఓనమాలు||39781|PD|te|20
43220|కత్తుల వంతెన||39781|PD|te|16

So the generated file has to be correct.

I'll try to run the english and spanish book-downloading over the night to reproduce the exact error :)