openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
46 stars 5 forks source link

Scraper should check filename is valid #232

Closed benoit74 closed 7 months ago

benoit74 commented 7 months ago

When filename is passed, we do not check it is valid before starting the crawler. It would save time.

See https://farm.youzim.it/pipeline/b9b7b43d-01e0-4ae6-851f-13b4042d1d3d/debug

Command:

zimit --zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim --lang=English --url=https://turoksanctum.com/ --name=turoksanctum.com_115fb1f0 --userAgentSuffix=Youzim.it+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json --adminEmail=contact+zimfarm@kiwix.org --keep --publisher=openZIM

Which is turn starts:

warc2zim --zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim --name=turoksanctum.com_115fb1f0 --publisher=openZIM --output /output --url https://turoksanctum.com/

Which replies with return code 100 (i.e. ok) while it could detect that --zim-file parameter is an unsupported file name for local filesystem.

Final error was:

Processing WARC files in /output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive
16 WARC files found
Calling warc2zim with these args: ['--zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim', '--name=turoksanctum.com_115fb1f0', '--publisher=openZIM', '--output', '/output', '--url', 'https://turoksanctum.com/', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive']
[DEBUG] Confirming output is writable using /output/tmplf6kxht_
[DEBUG] Title: Turok Sanctum – Turok Mods and Maps
[DEBUG] Language: en-US
[DEBUG] Favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
[DEBUG] Found WARC record for favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 464, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
    return converter.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 265, in run
    self.creator = Creator(
  File "libzim/libzim.pyx", line 263, in libzim._Creator.__cinit__
OSError: Unable to write ZIM file at /output/Turok Sanctum 04/13/2024_115fb1f0.zim
dan-niles commented 7 months ago

@benoit74 I would like to work on this.

One way we can check for invalid filenames is by using a regular expression pattern (like the one found here) to match invalid characters and then return an error before the scraping process starts.

Or we can use a library like pathvalidate to validate filenames. It has useful functions that can validate a filename and return the reason for the error (see validate_filename) or just check the filename and return a boolean value (see is_valid_filename).

What do you think would be the more suitable method to solve this issue?

benoit74 commented 7 months ago

@dan-niles thank you!

What about simply doing a touch on the filename with https://docs.python.org/3/library/pathlib.html#pathlib.Path.touch and immediately removing the created file? This would avoid relying on regexp (always a bit fragile) or adding a library for a simple need.

It is not like we are validating 10s of filename per seconds where it would be an issue.

WDYT about this idea?

Please note that we are already about to check that output folder is really usable with #106, you should probably just enhance this code with a check on ZIM filename.

dan-niles commented 7 months ago

Sure, that sounds like a good idea, much simpler than using regex or a separate library. I'll add a try-except block and use touch and unlink to create the file and delete it immediately. If a exception is thrown, I'll log and exit.