Closed benoit74 closed 7 months ago
@benoit74 I would like to work on this.
One way we can check for invalid filenames is by using a regular expression pattern (like the one found here) to match invalid characters and then return an error before the scraping process starts.
Or we can use a library like pathvalidate to validate filenames. It has useful functions that can validate a filename and return the reason for the error (see validate_filename) or just check the filename and return a boolean value (see is_valid_filename).
What do you think would be the more suitable method to solve this issue?
@dan-niles thank you!
What about simply doing a touch
on the filename with https://docs.python.org/3/library/pathlib.html#pathlib.Path.touch and immediately removing the created file? This would avoid relying on regexp (always a bit fragile) or adding a library for a simple need.
It is not like we are validating 10s of filename per seconds where it would be an issue.
WDYT about this idea?
Please note that we are already about to check that output
folder is really usable with #106, you should probably just enhance this code with a check on ZIM filename.
Sure, that sounds like a good idea, much simpler than using regex or a separate library.
I'll add a try-except block and use touch
and unlink
to create the file and delete it immediately. If a exception is thrown, I'll log and exit.
When filename is passed, we do not check it is valid before starting the crawler. It would save time.
See https://farm.youzim.it/pipeline/b9b7b43d-01e0-4ae6-851f-13b4042d1d3d/debug
Command:
Which is turn starts:
Which replies with return code 100 (i.e. ok) while it could detect that
--zim-file
parameter is an unsupported file name for local filesystem.Final error was: