Closed benoit74 closed 3 months ago
Second problem:
All RSYNC path are inserted, even ones we known in advance we won't use:
old
subdir (there are many)For example for book 84, rsync gives this:
drwxrwxr-x 5 2023/12/18 05:16:53 8/84
-rw-rw-r-- 448,642 2022/12/02 20:35:38 8/84/84-0.txt
drwxrwxr-x 3 2022/12/02 21:21:20 8/84/84-h
-rw-rw-r-- 466,919 2022/12/02 20:34:04 8/84/84-h/84-h.htm
drwxrwxr-x 25 2022/12/02 21:21:20 8/84/old
-rw-rw-r-- 463,170 2014/08/05 09:00:00 8/84/old/20080617-84-h.htm
-rw-rw-r-- 171,862 2014/08/05 09:00:00 8/84/old/20080617-84-h.zip
-rw-rw-r-- 448,717 2014/08/05 09:00:00 8/84/old/20080617-84.txt
-rw-rw-r-- 169,458 2014/08/05 09:00:00 8/84/old/20080617-84.zip
-rw-rw-r-- 448,717 2014/08/05 12:41:54 8/84/old/84.txt
-rw-rw-r-- 439,077 1993/11/02 09:00:00 8/84/old/frank10.txt
-rw-rw-r-- 183,377 1993/11/02 09:00:00 8/84/old/frank10.zip
-rw-rw-r-- 441,008 1993/10/26 08:00:00 8/84/old/frank10a.txt
-rw-rw-r-- 182,340 1993/10/26 08:00:00 8/84/old/frank10a.zip
-rw-rw-r-- 439,073 1996/05/15 09:00:00 8/84/old/frank11.txt
-rw-rw-r-- 183,379 1996/05/15 09:00:00 8/84/old/frank11.zip
-rw-rw-r-- 441,010 1996/05/15 09:00:00 8/84/old/frank11a.txt
-rw-rw-r-- 182,335 1996/05/15 09:00:00 8/84/old/frank11a.zip
-rw-rw-r-- 439,072 1996/05/15 09:00:00 8/84/old/frank12.txt
-rw-rw-r-- 183,376 1996/05/15 09:00:00 8/84/old/frank12.zip
-rw-rw-r-- 441,010 1995/09/26 08:00:00 8/84/old/frank12a.txt
-rw-rw-r-- 182,328 1995/09/26 08:00:00 8/84/old/frank12a.zip
-rw-rw-r-- 439,072 1996/05/15 09:00:00 8/84/old/frank13.txt
-rw-rw-r-- 183,377 1996/05/15 09:00:00 8/84/old/frank13.zip
-rw-rw-r-- 443,422 2004/05/16 09:00:00 8/84/old/frank14.txt
-rw-rw-r-- 168,131 2004/05/16 09:00:00 8/84/old/frank14.zip
-rw-rw-r-- 448,858 2005/05/30 09:00:00 8/84/old/frank15.txt
-rw-rw-r-- 169,574 2005/05/30 09:00:00 8/84/old/frank15.zip
drwxrwsr-x 21 2023/11/01 08:51:08 cache/epub/84
-rw-r--r-- 60,358 2024/01/01 09:53:43 cache/epub/84/84-cover.png
-rw-r--r-- 232,906 2024/01/01 09:53:37 cache/epub/84/pg84-h.zip
-rw-r--r-- 270,064 2024/01/01 09:53:44 cache/epub/84/pg84-images-3.epub
-rw-r--r-- 476,481 2024/01/01 09:53:48 cache/epub/84/pg84-images-kf8.mobi
-rw-rw-r-- 271,123 2024/01/01 09:53:39 cache/epub/84/pg84-images.epub
-rw-r--r-- 466,461 2024/01/01 09:53:37 cache/epub/84/pg84-images.html
-rw-r--r-- 466,466 2023/10/01 09:51:48 cache/epub/84/pg84-images.html.utf8
-rw-r--r-- 448,727 2024/01/01 09:53:43 cache/epub/84/pg84-images.mobi
-rw-rw-r-- 1,907,380 2024/01/01 09:53:48 cache/epub/84/pg84.converter.log
-rw-r--r-- 17,454 2024/01/01 09:53:38 cache/epub/84/pg84.cover.medium.jpg
-rw-r--r-- 4,015 2024/01/01 09:53:38 cache/epub/84/pg84.cover.small.jpg
-rw-rw-r-- 271,123 2024/01/01 09:53:38 cache/epub/84/pg84.epub
-rw-r--r-- 448,898 2022/09/01 10:03:10 cache/epub/84/pg84.mobi
-rw-rw-r-- 424 2013/04/05 20:33:41 cache/epub/84/pg84.qrcode.desktop.png
-rw-rw-r-- 431 2013/04/05 20:33:42 cache/epub/84/pg84.qrcode.mobile.png
-rw-rw-r-- 303 2024/01/01 09:53:48 cache/epub/84/pg84.qrcode.png
-rw-rw-r-- 17,989 2024/01/01 09:53:48 cache/epub/84/pg84.rdf
-rw-r--r-- 448,965 2024/01/01 09:53:37 cache/epub/84/pg84.txt
-rw-r--r-- 448,965 2023/10/01 09:51:47 cache/epub/84/pg84.txt.utf8
Which means that 46 URLs are currently inserted for book 84 while only 21 might be of interest.
When we pass the
--book
CLI argument, only few books will be processed.However, in the
parse
phase the scraper logic still inserts all paths found in RSYNC result as potential URL. This leads to enormous waste of time, typically when debugging.