mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.86k stars 976 forks source link

How do you incorporate metadata into a file? #4086

Open Deejay85 opened 1 year ago

Deejay85 commented 1 year ago

I tried to use the metadata settings as displayed in the --help menu. Unfortunately all that would happen is that I would either be given multiple files to accompany the file, or one big file with all the metadata information in it…neither of these is what I am looking for. If at all possible how do I integrate the metadata into the image/video file itself, and additionally, instead of having the following structure:

...tag1 ...tag2 ...tag3 ,,,etc.

what program could I use to be able to sort the images by their metadata if I were trying to find images that only had a following combination of tags (example: mother, son, picnic, sunny, etc.)?

AlttiRi commented 1 year ago

Gallery-dl writes metadata into separated text file, not into EXIF (also only some file formats support it).

BTW, I find the questionable the way to write extra meta data exactly into the downloaded files, since it will make not possible to recalculate the original hashes like MD5, which can be useful for searches on some sites, or de-duplication.


Off-topic:

BTW2, there is alternate data stream (ADS). Tecnitically, it's a perfect thing for storing the file's metadata. But it's very unpopular.

Browsers use it to store the download URL of a file. For example, if you download this URL (https://i.imgur.com/Xulubox.jpeg) from a browser, you can open the metadata file, just append :Zone.Identifier to the name.

A command to open Zone.Identifier stream with the notepad from cmd:

notepad %USERPROFILE%\Downloads\Xulubox.jpeg:Zone.Identifier

The result is the opened txt file with the follow metadata content:

[ZoneTransfer]
ZoneId=3
ReferrerUrl=https://github.com/mikf/gallery-dl/issues/4086
HostUrl=https://i.imgur.com/Xulubox.jpeg
Deejay85 commented 1 year ago

The thing is I am trying to conserve disk space. As you can imagine, when you got around 250 folders (most of them artists), there are going to be a lot of duplicates and wasted space...especially if it's a giant 500MB file that needs to be compressed down using Handbrake. Point is, if I could just dump everything into one folder, and search the metadata that way, it would have a lot more versatility, and I seriously doubt I would be the only one jumping for joy, because that would mean several TBs worth of space has been saved.

taskhawk commented 1 year ago

I don't understand how adding the metadata to the files, assuming it's doable, is going to save disk space.

AlttiRi commented 1 year ago

More over in case of modifying of the original file (by writing the additional metadata directly to it) you will not be able to de-duplicate them based on hash comparison when you download the same image from different services (different services — different metadata).


The files above (1, 2, 3) are the same, you can just de-duplicate them with any software. For example, you can simple replace the duplicates with the hard links. (Note: the bad side of hardlinks is that the files will share the same mtime/btime/...)


I just write the required metadata to a formatted .html, or just to a .txt file and place it in a sub-folder (metadata) near with the same filename, except the extension. If I have a filename of an image I can easily find the metadata file since it has almost the same filename.


UPD: Also I would note the obvious thing is that after any editing the file is no more the original file. For some collectors it would be important to have exactly same file as it was posted by the original author, without any third-party editing.

Hrxn commented 1 year ago

^ excellent point

taskhawk commented 1 year ago

I download artists' works from several services as well and it's not always the case that the files have matching hashes. They could have modified metadata (in the file itself) for whatever reason, or if they are PNGs they could have different compression levels, for example.

I have found it best to use software that de-duplicates using visual similarity. For images I use findimagedupes (Perl version) with good results. Now I just need to find something for videos.

I do almost the same thing with metadata files, creating an additional file containing just tags from the services that have them, in case I want to find something specific. I just search those files from the command line using grep in my system. If I did it more frequently I would probably write a script to make it smoother.

Twi-Hard commented 1 year ago

I process both pngs and jpgs losslessly since it makes the files much smaller. The biggest benefit though is that pngs that are the same image but posted on different sites become 100% identical after processing (because the format is lossless but the compression is different on each site). This saves me literally over 10TB of space because my file-system doesn't store duplicate data (ZFS). Every site processes images differently which changes the hash. I have 5 boorus based on philomena.. 4 are clones of the 1st (but there're all a bit different). If I download an image that was uploaded to the 1st site then imported to the other 4 then the hashes are usually completely different. After processing the pngs become identical (and most images on the boorus are png). In case anybody cares, this is how I do it:

{
    "name": "exec",
    "async": false,
    "archive": "/path/to/archive/booru-optimization-png.sqlite3",
    "archive-prefix": "{category}, ",
    "archive-format": "{id}.{extension!l}",
    "archive-pragma":
    [
        "journal_mode=WAL",
        "synchronous=NORMAL"
    ],
    "command": "oxipng -o max --strip safe --force --preserve {}",
    "event": "after",
    "filter": "extension == 'png'",
    "mtime": true,
    "whitelist":
    [
        "derpibooru",
        "furbooru",
        "manebooru",
        "ponerpics",
        "ponybooru",
        "twibooru"
    ],
    "blacklist": null
},
{
    "name": "exec",
    "async": false,
    "archive": "/path/to/archive/booru-optimization-jpg.sqlite3",
    "archive-prefix": "{category}, ",
    "archive-format": "{id}.{extension!l}",
    "archive-pragma":
    [
        "journal_mode=WAL",
        "synchronous=NORMAL"
    ],
    "command": "exiftran -ai {} && jpegtran -copy none -perfect -optimize -outfile {} {}",
    "event": "after",
    "filter": "extension == 'jpg'",
    "mtime": true,
    "whitelist":
    [
        "derpibooru",
        "furbooru",
        "manebooru",
        "ponerpics",
        "ponybooru",
        "twibooru"
    ],
    "blacklist": null
},

different versions of jpegtran are better than others**

a84r7a3rga76fg commented 1 year ago

@Twi-Hard Why not use JPEG XL cjxl -j 1 -d 0 -e 7? It does lossless transcoding, saves space, much faster than PNG/JPG compression tools, gives you the same hash for differently compressed but pixel identical PNG files and it even lets you to losslessly revert the conversion. PNG/JPG compression tools don't always do lossless despite claiming otherwise (use imagemagick to see for yourself magick compare -verbose -metric mae rose.jpg reconstruct.jpg difference.png). I've transcoded hundreds of thousands of pictures to JPEG XL and compared each file with the original and I've never encountered a single pixel difference.

image

kattjevfel commented 1 year ago

@a84r7a3rga76fg As much as I want JXL to be a success, currently a lot of software doesn't support it. Neither browsers (without enabling debug features) or afaik any phones support it. And even if perhaps your device supports it, you can then not easily share the files with others.

a84r7a3rga76fg commented 1 year ago

You can send any file on a phone. I haven't tried it but I'm pretty sure mpv on Android supports it, there are file players on iOS such as Outplayer that uses mpv as backend and may also support it. I use Jpegview which is popular and supports it out of the box. Neeview is another popular open-source image viewer too. Only thing missing is mainstream browsers supporting it because of Google's monopoly which is not an excuse to dismiss JXL.

Like I said, you can losslessly revert the conversion when you want the previous file format back, you're not losing anything.

taskhawk commented 1 year ago
"command": "oxipng -o max --strip safe --force --preserve {}"

@Twi-Hard it seems you could further optimize by adding --alpha and --interlace 0. Edit: Nevermind, --interlace is 0 by default.

Also, isn't --force counterproductive?

--force           Write the output even if it is larger than the input

different versions of jpegtran are better than others**

Could you expand on this?

Twi-Hard commented 1 year ago

I use --force because the main reason of using oxipng is to make images identical so they dedupe properly (so duplicate files don't take space). If there's an image that appears on one site that is compressed better than oxipng is capable of and that image is also on another site but not compressed as well as oxipng can then using --force would make them both the same. I archive well over 100 sites so this probably happens. It also helps me know which images I have are completely the same image.

When I first started using these tools I did a ton of testing and research and for jpegtran specifically I read the mozilla version is better at compression so I stuck with that. There's several other versions out there like a "normal" one and one by cloudflare. I can't remember if I compared the differences between them or not but they're definitely different https://github.com/cloudflare/jpegtran https://github.com/mozilla/mozjpeg

Also, thanks for the --alpha suggestion. I should try that

I did a little looking and this seems to be one of the things that affected my choice: "There exists a "regular" jpegtran and a MozJPEG version of jpegtran. That's the same program, but the MozJPEG version has different default settings and performs extra work to compress better." I remember looking at graphs comparing the tools too. I think I remember the mozilla one having more options.

"command": "exiftran -ai {} && jpegtran -copy none -perfect -optimize -outfile {} {}",

Something important to note in this command is the exiftran -ai part. When I was testing I noticed that images that had metadata telling it which orientation to be viewed at would lose that metadata after using jpegtran so the images would have the wrong orientation. exiftran -ai uses that metadata to rotate the image according to the metadata.

taskhawk commented 1 year ago

Heads up, I just learned that the --alpha option is technically lossy and could cause some issues depending on the image and application.

https://github.com/shssoichiro/oxipng/issues/164 https://github.com/shssoichiro/oxipng/pull/187

This was not stated on their README page nor the command's help message and I only found out because I was looking through old issues.

Also while learning about PNG ancillary chunks I noticed that the --safe option removes chunks that do affect the rendering of the image, like the gAMA chunk. They were originally on the safe list but don't know why that changed.

Blocks that should never be stripped: IHDR, IDAT, tRNS, PLTE, IEND. Blocks that should be preserved with safe: All of the above, plus cHRM, gAMA, iCCP, sBIT, sRGB, bKGD, hIST, pHYs, sPLT

Currently the --safe option only preserves cICP (new one), iCCP, sRGB and pHYs.

Depending what viewing software you use it could really affect how the image is shown.

https://github.com/shssoichiro/oxipng/issues/22 https://github.com/shssoichiro/oxipng/issues/24

taskhawk commented 1 year ago

@Deejay85 have you figured out what are you going to do about the metadata?

Twi-Hard commented 1 year ago

Also while learning about PNG ancillary chunks I noticed that the --safe option removes chunks that do affect the rendering of the image, like the gAMA chunk. They were originally on the safe list but don't know why that changed.

I don't really know how to read code but doesn't this say it doesn't strip gAMA? https://github.com/shssoichiro/oxipng/blob/067f344823b8648ab3b97e193149429753d76752/src/lib.rs#LL238C1-L240C91

taskhawk commented 1 year ago

Yeah, but that was the code 8 years ago. The current code, from 4 days ago, seems to be this one:

https://github.com/shssoichiro/oxipng/blob/88b930b5b196f44ce103903acf12d2bbe1eda1ec/src/headers.rs#L89

I don't know how to read Rust either but I downloaded the code, grep for "gAMA" and didn't found any mention of it.

I also tested the --safe option on a PNG with gAMA and it was indeed removed. I checked the chunks using pngcheck.

Twi-Hard commented 1 year ago

What command should I use then? They removed a lot more than just gAMA since then By which command I mean which options to use for --strip or --keep

taskhawk commented 1 year ago

Leaving --strip and --keep out should leave all ancillary chunks intact.

If you want to remove chunks that won't affect how the image is rendered you could remove the textual ones and a couple others with --strip:

--strip tEXt,zTXt,iTXt,eXIf,tIME

Check https://w3c.github.io/PNG-spec/#11Ancillary-chunks for more information on each chunk.