openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

All RSYNC paths are inserted as Url in DB #220

Closed benoit74 closed 3 months ago

benoit74 commented 3 months ago

When we pass the --book CLI argument, only few books will be processed.

However, in the parse phase the scraper logic still inserts all paths found in RSYNC result as potential URL. This leads to enormous waste of time, typically when debugging.

benoit74 commented 3 months ago

Second problem:

All RSYNC path are inserted, even ones we known in advance we won't use:

For example for book 84, rsync gives this:

drwxrwxr-x              5 2023/12/18 05:16:53 8/84
-rw-rw-r--        448,642 2022/12/02 20:35:38 8/84/84-0.txt
drwxrwxr-x              3 2022/12/02 21:21:20 8/84/84-h
-rw-rw-r--        466,919 2022/12/02 20:34:04 8/84/84-h/84-h.htm
drwxrwxr-x             25 2022/12/02 21:21:20 8/84/old
-rw-rw-r--        463,170 2014/08/05 09:00:00 8/84/old/20080617-84-h.htm
-rw-rw-r--        171,862 2014/08/05 09:00:00 8/84/old/20080617-84-h.zip
-rw-rw-r--        448,717 2014/08/05 09:00:00 8/84/old/20080617-84.txt
-rw-rw-r--        169,458 2014/08/05 09:00:00 8/84/old/20080617-84.zip
-rw-rw-r--        448,717 2014/08/05 12:41:54 8/84/old/84.txt
-rw-rw-r--        439,077 1993/11/02 09:00:00 8/84/old/frank10.txt
-rw-rw-r--        183,377 1993/11/02 09:00:00 8/84/old/frank10.zip
-rw-rw-r--        441,008 1993/10/26 08:00:00 8/84/old/frank10a.txt
-rw-rw-r--        182,340 1993/10/26 08:00:00 8/84/old/frank10a.zip
-rw-rw-r--        439,073 1996/05/15 09:00:00 8/84/old/frank11.txt
-rw-rw-r--        183,379 1996/05/15 09:00:00 8/84/old/frank11.zip
-rw-rw-r--        441,010 1996/05/15 09:00:00 8/84/old/frank11a.txt
-rw-rw-r--        182,335 1996/05/15 09:00:00 8/84/old/frank11a.zip
-rw-rw-r--        439,072 1996/05/15 09:00:00 8/84/old/frank12.txt
-rw-rw-r--        183,376 1996/05/15 09:00:00 8/84/old/frank12.zip
-rw-rw-r--        441,010 1995/09/26 08:00:00 8/84/old/frank12a.txt
-rw-rw-r--        182,328 1995/09/26 08:00:00 8/84/old/frank12a.zip
-rw-rw-r--        439,072 1996/05/15 09:00:00 8/84/old/frank13.txt
-rw-rw-r--        183,377 1996/05/15 09:00:00 8/84/old/frank13.zip
-rw-rw-r--        443,422 2004/05/16 09:00:00 8/84/old/frank14.txt
-rw-rw-r--        168,131 2004/05/16 09:00:00 8/84/old/frank14.zip
-rw-rw-r--        448,858 2005/05/30 09:00:00 8/84/old/frank15.txt
-rw-rw-r--        169,574 2005/05/30 09:00:00 8/84/old/frank15.zip
drwxrwsr-x             21 2023/11/01 08:51:08 cache/epub/84
-rw-r--r--         60,358 2024/01/01 09:53:43 cache/epub/84/84-cover.png
-rw-r--r--        232,906 2024/01/01 09:53:37 cache/epub/84/pg84-h.zip
-rw-r--r--        270,064 2024/01/01 09:53:44 cache/epub/84/pg84-images-3.epub
-rw-r--r--        476,481 2024/01/01 09:53:48 cache/epub/84/pg84-images-kf8.mobi
-rw-rw-r--        271,123 2024/01/01 09:53:39 cache/epub/84/pg84-images.epub
-rw-r--r--        466,461 2024/01/01 09:53:37 cache/epub/84/pg84-images.html
-rw-r--r--        466,466 2023/10/01 09:51:48 cache/epub/84/pg84-images.html.utf8
-rw-r--r--        448,727 2024/01/01 09:53:43 cache/epub/84/pg84-images.mobi
-rw-rw-r--      1,907,380 2024/01/01 09:53:48 cache/epub/84/pg84.converter.log
-rw-r--r--         17,454 2024/01/01 09:53:38 cache/epub/84/pg84.cover.medium.jpg
-rw-r--r--          4,015 2024/01/01 09:53:38 cache/epub/84/pg84.cover.small.jpg
-rw-rw-r--        271,123 2024/01/01 09:53:38 cache/epub/84/pg84.epub
-rw-r--r--        448,898 2022/09/01 10:03:10 cache/epub/84/pg84.mobi
-rw-rw-r--            424 2013/04/05 20:33:41 cache/epub/84/pg84.qrcode.desktop.png
-rw-rw-r--            431 2013/04/05 20:33:42 cache/epub/84/pg84.qrcode.mobile.png
-rw-rw-r--            303 2024/01/01 09:53:48 cache/epub/84/pg84.qrcode.png
-rw-rw-r--         17,989 2024/01/01 09:53:48 cache/epub/84/pg84.rdf
-rw-r--r--        448,965 2024/01/01 09:53:37 cache/epub/84/pg84.txt
-rw-r--r--        448,965 2023/10/01 09:51:47 cache/epub/84/pg84.txt.utf8

Which means that 46 URLs are currently inserted for book 84 while only 21 might be of interest.