mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.8k stars 967 forks source link

Provide a reliable "filename"-like keyword for filenaming that is always based on URI (or/and header) #6132

Open fireattack opened 2 months ago

fireattack commented 2 months ago

Currently filename works for most sites but not always.

For example, imgbox: https://imgbox.com/SvCSMjtI

I want to name the downloaded file as SvCSMjtI_o.jpg (from the image direct URL, https://images2.imgbox.com/5d/90/SvCSMjtI_o.jpg), but filename keyword is actually a site-specific yuka_s1113-1830971240661819757-20240903220856-03 (the original filename when the uploader uploaded the file) one.

URLs with headers like Content-Disposition: attachment; filename="a_different_name.jpg" are trickier as both can be considered as reasonable "real" filename, but it does not apply to the above case since it has no such header.

Also, a suggestion for the doc:

Provide a list of commonly used keywords in readme.

I understand that you can use -K to obtain the full list of keywords, and lots of keywords are site-specific, but there are also lots of common ones (like extension, title, etc.). I think providing a list of these most commonly used ones are useful for a new user. At least, I personally was very confused when I used gallery-dl for the first time and try to find the keyword/field conventions for naming my files.

Furthermore, I think the link in "and powerful filenaming capabilities" should go to a document with such keyword name list, together with formatting info which we currently have (basically, similar to yt-dlp's approach:https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#output-template).

a84r7a3rga76fg commented 2 months ago

image_key and extension will save that picture as SvCSMjtI_o.jpg.

fireattack commented 2 months ago

That's just an example, the point is to get filename just like what wget, browser etc. would get. There is no guarantee it would always be {image_key}_o.{extension}.

Hrxn commented 2 months ago

I agree with the idea. But first, we'll have to agree on an approach here.

For example, emulating curl with the curl -J -O options.

It uses the name from Content-Disposition if and only if there's a valid Content-Disposition header, otherwise it takes the filename from the last segment of the URL.

But none of this is foolproof. Content-Disposition may contain path information etc., it may result in an filename that already exist in the current directory (resulting in an undesired skip (or, even worse, overwrite)). This should be handled. Like, by always stripping what could be considered path names. The header is completely controlled by the server, and there might be encoding/character-set issues etc. so you could end up with percent-encoded filenames that you maybe don't want.

How to handle redirects? If you use curl -O, it always takes the filename from the respective segment of the input URL, regardless of redirects.

fireattack commented 1 month ago

Ideally, I would preferer we use (in this order) the filename provided by Content-Disposition if any, the filename in directed URL, and then the filename in original URL. (There could also be filename extension auto-fix by reading Content-Type header -- like browsers' save as usually do.)

But I assume our current check for dupe/skip mechanism kicks in entirely before making the actual HTTP request?

If so, yeah indeed that will above things more complicated. And in that case, I think a static filename purely based on original input URL (something as simple as unquote(url.split('?')[0].split('#')[0].split('/')[-1])) is more than enough.