Closed peci1 closed 1 year ago
Hi @peci1, and thanks for your interest in this library. Unfortunately, I'm not convinced that this change is an ideal solution here, since it is a Windows-specific change that would have unexpected side-effects for other platforms. Can you clarify what you mean by "Windows has a problem"? Perhaps that will allow us to find a different fix.
Windows does not allow colon in filenames and filepaths (there are more, see https://stackoverflow.com/a/31976060/1076564). On linux, colon is allowed, but has to be escaped when used in shell, which also causes inconvenience.
Thanks for clarifying that, @peci1. Makes sense. In this case, I'm leaning toward this solution:
Provide a .download_to(...)
method parameter, replace_colon
, which defaults to True
on Windows but False
on all other platforms. Users can set this parameter to any bool
or str
.
If replace_colon==True
, then colons are replaced by _
characters. If isinstance(replace_colon, str)
, then colons are replace by whatever that string is. If replace_colon==False
, then colons are kept as-is.
Provide similar options for the command-line tool, as --replace-colon
.
What do you think?
I should also note: Aside from the port number separate, colons might also appear in the filename or file path. So I think this replacement should be performed on the final path destination.
I'm not sure it makes sense to allow False on Windows. The generated filename will be invalid. And what about the other forbidden characters?
I'm not sure it makes sense to allow False on Windows.
If you were particularly concerned about False
on Windows (which someone would have to intentionally/explicitly set, which seems unlikely), then you could add a ValueError
exception in those scenarios.
And what about the other forbidden characters?
It's a good question. Do you have a list of those? If they are impossible to anticipate, or too variable by platform, I suppose another approach would be to let the user define a regular expression, whitelist, or blacklist of characters that they wanted to treat differently.
If you were particularly concerned about
False
on Windows (which someone would have to intentionally/explicitly set, which seems unlikely), then you could add aValueError
exception in those scenarios.
Sounds good.
Do you have a list of those?
It's in the SO answer I linked. On Windows, it is 9 printable ASCII chars + all non-printable ASCII chars. On Linux, the only forbidden character in file/directory name is forward slash. So that's not a problem. On Windows, it is more complicated as there are (according to the SO answer) filenames that are forbidden although composed only of allowed characters (i.e. filenames cannot end with a space or dot, or there are some reserved filenames like COM1).
My view is that this library does not necessarily need to go very deep into the complications of OS file naming rules. But picking up the simple rules (forbidden characters) chould cover 99% of all cases, which is a good ratio to me.
What about creating a static list of forbidden characters that would be filled according to the platform, and adding parameter forbidden_chars_replacement
(do you have a better name?) with default value of _
? On Linux, the list of forbidden chars would be empty, so no names would be mangled. On Windows, the default behavior would create valid filenames out of the box, while keeping the option to specify the substitution character.
It's in the SO answer I linked.
Thanks, and my apologies; I should have looked there first.
My view is that this library does not necessarily need to go very deep into the complications of OS file naming rules. But picking up the simple rules (forbidden characters) chould cover 99% of all cases, which is a good ratio to me.
That makes sense to me.
What about creating a static list of forbidden characters that would be filled according to the platform, and adding parameter
forbidden_chars_replacement
(do you have a better name?) with default value of_
?
That seems reasonable. Maybe fallback_char
or replacement_char
?
What do you think about adding a param, chars_to_replace
, that lets the user adjust this list? Or over-engineering for this particular situation?
Thanks, I've updated the PR with the new ideas. I think configurability of the invalid chars is over-engineering.
I also only added the fallback_char
parameter to the download function. I'd leave its propagation up to the CLI on you as I don't know the structure of the program as well...
I would not apply replace_invalid_chars
to the final constructed filepath
as that can already contain a colon on Windows (C:\file
).
Thank you, @peci1! I haven't forgotten about this; just been busy and want to give it a careful read-through.
The URL netloc part can contain port number, e.g.
doma.in:80
. When this is translated into filesystem path, Windows has a problem. Replacing the colon with underscore resolves it.