jhc13 / taggui

Tag manager and captioner for image datasets
GNU General Public License v3.0
495 stars 26 forks source link

add archive reading support #192

Open yggdrasil75 opened 4 weeks ago

yggdrasil75 commented 4 weeks ago

This is for 2 purposes: saving space, and just in general adding support for comics. This is a draft as I want to figure out how calibre stores comic metadata first and add that properly when embed option is enabled.

in addition: this changes the settings screen quite a bit because I thought I was adding too many options to leave it as 1 screen. I can probably split the 2 changes if you want better tracking of those changes, and with the settings I do plan to move some things in auto tagging to there that wont change more than once a session (ie: captioning device wont be changed often)

jhc13 commented 4 weeks ago

Splitting the settings into multiple tabs could be useful in the future.

However, support for comics is definitely out of scope for the program.

yggdrasil75 commented 4 weeks ago

I will split the prs then.

however on comics: would a more generic "index archives" option be more in the scope, and users could include comic book archives in their list of index archive formats? (since cbz is zip renamed, etc). this request is for 2 reasons: archives of comics (especially ones with consistent formatting) are much smaller on disk even with minimal compression, and it will help quite a bit organizing a calibre comic library or similar ebook library.

yggdrasil75 commented 4 weeks ago

also I did definitely grab the wrong source revision when making this pr. so many conflicts.

jhc13 commented 4 weeks ago

however on comics: would a more generic "index archives" option be more in the scope, and users could include comic book archives in their list of index archive formats? (since cbz is zip renamed, etc).

If they are simple zip files containing images (without special processing required for comic formats), it could be in scope. But I'm not sure if displaying the images, adding caption text files, etc., without decompressing the archives will be easy to achieve while also not being too slow.

and it will help quite a bit organizing a calibre comic library or similar ebook library.

As mentioned in the README, TagGUI is designed for managing image datasets for generative AI models. Other use cases are not specifically supported.

yggdrasil75 commented 4 weeks ago

comics require no special processing. its literally just zip archive but .cbz instead of .zip. (same with cbr and rar, cb7 and 7z, cbt and tar.gz) the archive could be decompressed in memory (depending on available ram and size of archive) with minimal writing to disk (if I use the proper library) then drop all extra files from memory via automatic garbage collection (ie: unsupported .comicinfo file and thumbs.db and so on) the only real issue would be writing. especially if the archive is solid or heavily compressed. if its base windows .zip then anything newer than a i5 4th gen can probably do it at near the same rate as uncompressed, but solid block archive methods (manual settings on a 7z) would start costing processing time. leaving it as an option would allow users to disable it if it slows down everything.

jhc13 commented 4 weeks ago

Alright. If everything works smoothly, it could be a useful addition.

I'm just slightly worried that it might end up being too slow and it would be a lot of wasted effort.