openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Scraper 2.0.0 fails to scrape #172

Closed kelson42 closed 1 year ago

kelson42 commented 1 year ago

https://farm.openzim.org/pipeline/f1dfacbc21b1214035793f36/debug

benoit74 commented 1 year ago

Looks like a release issue, this bug has been fixed in https://github.com/openzim/gutenberg/pull/168

The code displayed on the stderr is not anymore the one in main branch.

rgaudin commented 1 year ago

indeed

kelson42 commented 1 year ago

Not usual, @rgaudin how have we managed this?

rgaudin commented 1 year ago

I missed the last commit which unlike the others has not been merged in main. I'll rewrite the tag

rgaudin commented 1 year ago

My theory, am on my phone ; will check as soon as I arrive

rgaudin commented 1 year ago

Wow, it's even weirder:

  File "/usr/local/lib/python3.11/site-packages/gutenberg2zim-2.0.0-py3.11.egg/gutenbergtozim/download.py", line 197, in <listcomp>
    (b.format.mime, b.format.images, b.format.pattern)
     ^^^^^^^^
AttributeError: 'BookFormat' object has no attribute 'format'

See how line 197 of download.py supposedly contains b.format.mime ?

here's main

https://github.com/openzim/gutenberg/blob/524326b3792a82b683a97558214eb7c20e459136/gutenbergtozim/download.py#L196-L200

and here's the tag

https://github.com/openzim/gutenberg/blob/34fd1e7178d2e5a63c991054c481c10e2473f087/gutenbergtozim/download.py#L197

The problem is that both the tag and main points to the same commit but with a different ID… Github has the explanation: “This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.” which is weird because I did that commit on main branch.

I'm not sure exactly what I did wrong ; retagging now

rgaudin commented 1 year ago

Fixed