jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.8k stars 189 forks source link

Handle port number in URL #58

Closed peci1 closed 1 year ago

peci1 commented 1 year ago

The URL netloc part can contain port number, e.g. doma.in:80. When this is translated into filesystem path, Windows has a problem. Replacing the colon with underscore resolves it.

jsvine commented 1 year ago

Hi @peci1, and thanks for your interest in this library. Unfortunately, I'm not convinced that this change is an ideal solution here, since it is a Windows-specific change that would have unexpected side-effects for other platforms. Can you clarify what you mean by "Windows has a problem"? Perhaps that will allow us to find a different fix.

peci1 commented 1 year ago

Windows does not allow colon in filenames and filepaths (there are more, see https://stackoverflow.com/a/31976060/1076564). On linux, colon is allowed, but has to be escaped when used in shell, which also causes inconvenience.

jsvine commented 1 year ago

Thanks for clarifying that, @peci1. Makes sense. In this case, I'm leaning toward this solution:

What do you think?

I should also note: Aside from the port number separate, colons might also appear in the filename or file path. So I think this replacement should be performed on the final path destination.

peci1 commented 1 year ago

I'm not sure it makes sense to allow False on Windows. The generated filename will be invalid. And what about the other forbidden characters?

jsvine commented 1 year ago

I'm not sure it makes sense to allow False on Windows.

If you were particularly concerned about False on Windows (which someone would have to intentionally/explicitly set, which seems unlikely), then you could add a ValueError exception in those scenarios.

And what about the other forbidden characters?

It's a good question. Do you have a list of those? If they are impossible to anticipate, or too variable by platform, I suppose another approach would be to let the user define a regular expression, whitelist, or blacklist of characters that they wanted to treat differently.

peci1 commented 1 year ago

If you were particularly concerned about False on Windows (which someone would have to intentionally/explicitly set, which seems unlikely), then you could add a ValueError exception in those scenarios.

Sounds good.

Do you have a list of those?

It's in the SO answer I linked. On Windows, it is 9 printable ASCII chars + all non-printable ASCII chars. On Linux, the only forbidden character in file/directory name is forward slash. So that's not a problem. On Windows, it is more complicated as there are (according to the SO answer) filenames that are forbidden although composed only of allowed characters (i.e. filenames cannot end with a space or dot, or there are some reserved filenames like COM1).

My view is that this library does not necessarily need to go very deep into the complications of OS file naming rules. But picking up the simple rules (forbidden characters) chould cover 99% of all cases, which is a good ratio to me.

What about creating a static list of forbidden characters that would be filled according to the platform, and adding parameter forbidden_chars_replacement (do you have a better name?) with default value of _? On Linux, the list of forbidden chars would be empty, so no names would be mangled. On Windows, the default behavior would create valid filenames out of the box, while keeping the option to specify the substitution character.

jsvine commented 1 year ago

It's in the SO answer I linked.

Thanks, and my apologies; I should have looked there first.

My view is that this library does not necessarily need to go very deep into the complications of OS file naming rules. But picking up the simple rules (forbidden characters) chould cover 99% of all cases, which is a good ratio to me.

That makes sense to me.

What about creating a static list of forbidden characters that would be filled according to the platform, and adding parameter forbidden_chars_replacement (do you have a better name?) with default value of _?

That seems reasonable. Maybe fallback_char or replacement_char?

What do you think about adding a param, chars_to_replace, that lets the user adjust this list? Or over-engineering for this particular situation?

peci1 commented 1 year ago

Thanks, I've updated the PR with the new ideas. I think configurability of the invalid chars is over-engineering.

I also only added the fallback_char parameter to the download function. I'd leave its propagation up to the CLI on you as I don't know the structure of the program as well...

I would not apply replace_invalid_chars to the final constructed filepath as that can already contain a colon on Windows (C:\file).

jsvine commented 1 year ago

Thank you, @peci1! I haven't forgotten about this; just been busy and want to give it a careful read-through.