simulot / immich-go

An alternative to the immich-CLI command that doesn't depend on nodejs installation. It tries its best for importing google photos takeout archives.
GNU Affero General Public License v3.0
1.5k stars 47 forks source link

Feature request/bug report for .tgz files #36

Closed Jurrer closed 11 months ago

Jurrer commented 11 months ago

I was trying to use this great tool on a .tgz file generated by Google Takeout, but all I got was OK

Server status: OK
Get server's assets... 713 received
Browsing google take out archive...Done.
0 media scanned, 0 uploaded.
Done.

I don't really know if you officially support non .zip files or not, but it would nicely complete whole experience.

Upload command worked flawlessly on unpacked directory tho. 😁

simulot commented 11 months ago

Thanks for using the project. I'm glad you have successfully imported your photos.

The takeout archive is not very elaborated. You need to know where all the files are located before you can work with them.

Usually, you have to unzip all the parts and then process the resulting folders one by one. Fortunately, the list of files can be obtained before you unzip them. It's easy to read individual files without unpacking the entire archive. Immich-go takes advantage of this to process all zip files without first unzipping everything. This is why the program is so fast at handling zip files.

For tar files (and tgz files are similar), you can only read them sequentially. The list of their files becomes known only after unpacking the entire archive. Because, files are scattered accros all parts, all part must be unpacked before starting the process.

The idea of opening all parts of the archive simultaneously and checking files can't be used. Immich-go would still need to unpack all the parts in any case.

So, I choose not to proceed with implementing this request, knowing there are alternatives:

  1. Unpack all the tgz files and then work with the resulting folder.
  2. Request Google to create zip files.
stevenwalton commented 9 months ago

For tar files (and tgz files are similar), you can only read them sequentially. The list of their files becomes known only after unpacking the entire archive. Because, files are scattered accros all parts, all part must be unpacked before starting the process.

Is this true? I can perfectly read tar and zipped tar file names with vim or simply using tar -t or tar --list (you actually need tar -tf but you get the point). You can perfectly extract singular files out of a tar bundle (zipped or not) and so I'm not understanding what the issue is. First hit on google is this SO post. Just for clarity, tar is simply a bundle of files, -z enables gzip which is the compression, otherwise a tar bundle isn't compressed (sometimes people use bzip but that's not as common anymore). I don't know the first thing about Go so I can't really submit a PR but I do know nix systems. It looks like this SO post is performing the exact task of reading tar contents without unzipping.

I think this would be a big benefit to users since tar bundles are the best way to download large amounts of photos. They have a higher 50GB limit where zip is 20.

simulot commented 9 months ago

The takeout archive isn't well organized. Albums may miss some images, images that can be found in an other folder... A sequential read doesn't allows this.

A tar file is a sequential file. This should come from the fact it was designed to be written on magnetic tapes back on the days. To know where is a file and read it, you have read the tar until you find it. If the file is missing, you have read the whole archive for nothing. And a takeout file may contents thousands of files.

I know that standard go library can read a tar file, or a gzipped tar file. But the fundamental problem remains: a tar file is a pure sequential file. Code on SO is just reading the whole file, from the beginning to the it end.

The solution is as simple as decompressing the tgz file in a folder and use that folder as input. I prefer to let the user doing it, to chose where to decompress the file, etc...

stevenwalton commented 9 months ago

I guess I'm not understanding then (I'm trying to, sometimes not as clear over text. I would like to learn and correct my understanding). Maybe my misunderstanding is through takeout? I'm not quite getting what you mean by sequential file and the role that plays here. Doesn't the uploader create albums if they don't exist and add to them if they do? If so, why could we not extract a file at a time (this can be parallelized) and place in the right location and create the right file structure. Is this not how you do it for zip? If we need to understand the whole structure before extracting (this is what confuses me) then we could just read all tar bundles provided, place the directories into a graph and now we have our full structure (fwiw, if I open the tarball with vim the system actually calls gzip -dc, is this what you're referring to?). The motivation seems to be the same as with zip, that we don't have to take the time to decompress, copy, delete. (Can you clarify, is immich-go performing a copy action or move action? I'm in the process so I'll find out soon enough but my poor pi lol).

The confusion is that when I'm looking at the takeout I see a POSIX like structure. For example Takeout/Google Photos/Photos from 2021/PXL_12345678_123456789.jpg.json or Takeout/Google Photos/ALBUM NAME/IMG_1234.JPG. Takeout/archive_browser.html seems to have all the folder information within it, including folders that are not in that tarball (under extracted-folder-name). I think they create a html page that has the directory structure. I believe it is only in the first tar file btw because I don't see it in the others (those I exclusively see media files and jsons, like your note in the readme discusses).

simulot commented 9 months ago

Granted, I haven't paid attention to the archive_brower.html.

I did the exercise to ask my takeout in tgz... Like zips, you are limited to 50 GB archives. So for me, 2 files.

The index file Takeout/archive_browser.html file is 44153rd file of the 1st archive:

time tar -xzf ~/Downloads/takeout-20231129T080411Z-001.tgz Takeout/archive_browser.html

real    3m45,718s
user    3m18,786s
sys     0m31,602s

3m45 seconds on my beefy intel core I7 12 cores, 16 GB of ram, with a SSD.... just to get the index file...

So it's not a big win. worse: images and json are scattered accros folders... and even worse accros tgz files...

So I see no way to use directly the tgz archives.

stevenwalton commented 9 months ago

Sure, but that's a one time hit, doesn't require the extra disk writes, only requires extracting one takeout (I have 6), and is only needed if you need to have all the files and their locations a priori.

I think where I'm confused and we're talking past one another is why you need to have that structure in the first place (I intended the archive_browser comment as an off hand, not a real solution). So here's why I'm a bit lost. I believe you are probably considering factors that I am not, so I want to figure out what those are. So allow me to explain how I'm seeing the issue and hopefully that can clarify what I've missed.

From my best understanding you have 99% of the data needed simply from the tarball and the image's exif data. I notice in your notes that you mention that the you are using the json files to repair album names and some meta data. I dug into documentation and the tarballs (I got time while my up uploads files) and my best understanding is that album names are only bad if they include unicode characters in their names (e.g. album "One & Two" becomes "One _ Two").

As far as I'm noticing though, metadata appears far more accurate from exiftool than the jsons. In fact, I personally can't find a json file with GPS data but have plenty of ones with GPS data in the exif metadata. After my upload competed I have a few dozen images located at Null Island. I also find the accurate creation date, which I've become more aware that Immich doesn't look for the right one. The tags Create Date and Modify Date sometimes show up twice but also seem to occur with a unique Date/Time Original tag where these are at the end (presumably appended by Google) (I'm going to update that issue with a clearer explanation). The metadata photoTakenTime variable exactly matches the latter values btw (I don't see anything that matches creationTime). Most importantly, we should use what timestamp data matches the filename. As far as I can tell, the json is composed of some extracted exif data and post processed information like people (maybe we care?), description (we can autofill for Immich), and the number of views (I don't think this matters).

So the process I see is: extract the file, grab the exif data, upload according to the directory's album name. The file can then be discarded (ideally upload is a move action not a copy action). If we run across a json file, we can either 1) store in a temporary location and do a post process or something like just extract what we need to a post-processing file. If we hit the metadata.json file (which most other json files have numbers, underscores, image/video extensions, or various keywords (e.g. "BURST", "PXL", "PANO", "NIGHT", etc. These all have numbers in the name too btw but we can't rely on that. Downloaded images won't always have a symbol. E.g. if you download this Imgur file you get jEkVxub.gif but there is a corresponding json. Don't know if that's guaranteed but it helps us reduce possible metadata files, which isn't guaranteed to be in every folder. At least my picture jsons have different keys, so reading it should tell you which you have). Either address the album renaming at this time or make a map that we'll use in the post-processing step or as soon as we find that file during extraction.

The motivation to upload and extract together? Causes fewer disk writes which is the main bottleneck here (especially something like on a pi). If we extract and then upload (as it appears the program does), we have to do a lot of waiting, so why not automate (at worst, the quick solution is just to add functionality to extract the tarball for the users). We can be far more efficient in our processing, since the extraction process is not actually disk nor CPU heavy. Essentially you're parallelizing the operation. We'll turn the bursty operations into a more smooth load. It saves users a lot of time and compute. No reason to walk through the file structure multiple times, one pass is enough which is done during tar extraction.

So am I missing or misunderstanding something?

===================

Extracting Side Note:

For extracting the tar files I recommend pv takeout-* | tar xzif -. You'll need to install pv but that lets you see progress. I did about 270GB in an hour on a PCIe 3 nvme. CPU isn't important, or even disk that much. I'd get bursts at 180M/s but generally 50-90M/s range because smaller files and my CPU (Ryzen 9 5900X barely noticed it). My I/O rate could have been handled by 1 PCIe lane. So for most people, something like this might be more logical find /path/to/tgz/files -type f -name "*.tgz" -print0 | xargs -0 -n 1 -P <num proc> tar xzif | pv. For comparison, I used as 6 and went from ~50 minutes to ~39 minutes. PCIe 3.0x2 can handle this, you should have 4 available. RAM was never an issue, the process's usage never hit 500Mb... (nowhere near my 64) and was not meaningful compared to idle. Probably should use a lower parallelization but I was just playing around while I was waiting on the pi.

Uploading Side Note:

My Immich-go upload had a lot of issues (that's done on a pi 4 btw). Most of my videos (they're usually 4K and about a minute so around half a gig), failed due to timeouts. I'm not sure what the best solution is here but some retries could help. If you do end up with the parallel route some async operations would nicely handle this. Another issue I saw was failures on "MP" files. Idk what Google is doing here but exif data shows that these are mp4 files and if you change the extension (if you're not using linux, because linux doesn't care about extensions) to MP4 it'll play just fine. These are the motion photos. If you do use the exif call on each item when uploading to get the metadata you could extract the MIME Type key (it'll show as video/mp4, or you can pull File Type to get just "mp4"). This could be a nice check if you hit an unknown filetype since I'm sure Google will pull weird things like this again and this gives a clean fallback.

simulot commented 9 months ago

Answering to your question give me an opportunity to reassess the whole procedure.

TGZ direct reading

The takeout archive isn't usable directly. Back in time, the import was using regular folders filled in with extraction of the archive. It was easy to peek files when needed. Implementing this directly on a set of ZIP files wasn't a big deal. But it wasn't possible with the TGZ files.

Since then, I have touched the code deeply to cope with problems of Google Photos takeouts files. The current version reads the content of all json files present in the archive and builds an index of regular files. Once the takeout puzzle is solved, files are uploaded.

I think TGZ direct import can be done in 2 passes and zero temporary write:

  1. Build a map of files, and read all JSON
  2. Rewind all tgz, and reread them, upload files according the plan build in pass 1 Reading can be parallelized. This is where GO shines.

That said, getting ZIPs files, or decompress the TGZ are easy workaround. I'll give this a low priority compare to bugs and functionalities. For regular archive import (non google photos), tgz could be implemented easily.

Why using JSON files?

JSON files are used for:

So we can't ignore JSONs.

EXIF information embedded into file is better that JSON metadata.

Immich read exif anyway at the upload time or with the "Extract metadata" job. If the metadata aren't read correctly, fill in an issue on the Immich project.

I have a plan to supply immich with sidecar files to correct immich data or to add GPS coordinates taken from a position tracking. This project is on hold because of open issues and the lack of immich API to manipulate the metadata.

Timeout when uploading 4K videos on a PI4

I'm not surprised. The timeout is due to the server overload.... Nothing that immich-go can resolve. Multitasking would increase the problem.

MP files can be ignored

Read this

stevenwalton commented 9 months ago

Thank you, this really helps clarify things. I appreciate you taking the time to explain it. As I'm learning I'm finding that image data is quite messy and it seems there's more exceptions to the rules than rules.

I think TGZ direct import can be done in 2 passes and zero temporary write:

I think it may be more efficient to do the single pass on the tarballs, since they are going to be the slowest part by far. The temporary information you'd need to store for the single pass option is quite small in comparison and likely could all fit in memory even for a raspberry pi. I mean all the json files are <1kb in size and most of that information can be removed too. This is easily handled (though I'm pretty confident immich has a memory leak but haven't had time to debug. My Pi's (4B+ with 4G) swap constantly fills up, will not release, and the whole system slows. But this isn't your issue).

Unless I'm missing something, the "takeout puzzle" can still be solved simultaneous to the extraction. I agree that there will be some errors in naming but, imo, it makes more sense to do a repair operation than to do a full reread (even if you early terminate the tar reads but I don't see how to do that consistently without essentially implementing the one pass logic anyways). Just considering the big O of both compute and memory the single pass is lower. The compute is the number of files (N) + number of repairs (R) while memory is the json files (at most N). Our worst case memory situation would be to read all json files first. If we have a really heavy user with 50k images (my 6 checkouts is <40k files which includes jsons) and they all had 1kb that's only 50mb and the processing should cut that by more than half. For compute, the reading tar operation is far heavier than the reading json operation even if R=N. Both our methods assume you read the json file so that's equal, the difference is just a second pass on the tarball vs a second pass on processed json information.

The question to me is more which is better: upload all images and then repair or upload and interrupt as soon as repair is possible. Go's concurrency would seem beneficial to the latter. I suspect collisions would be quite rare.

So we can't ignore JSONs.

I agree that json can't be ignored, I was trying to simply say that much of the data within it is redundant to the exif data and that at least for many of my files the exif data is far more reliable. But immich does have issues with this too. At minimum you need the json file to repair the album name. I don't know about filename but at least some testing on my mac shows that the exifdata preserves the unicode if I rename my file to some emoji but I can understand if something weird happens with gphotos.

Many users have no exif data in their photos

I'm going to have to trust you on this but I'm actually surprised. I haven't run into such a photo and all phones add exif data and as far as I'm aware the vast majority of cameras do and certainly the most popular ones (assuming digital but most film software is going to add exif data too). Plus gphotos adds exif data to images as well. I certainly don't have a file that has a sidecar but doesn't have the minimal datetimes in the exif data.

My point is more that at least in my files, the json metadata is unreliable and leads to bad datetimes.

Description (not used by immich-go, should it be?)

¯\_(ツ)_/¯ personally I don't use the feature but I certainly am not an average user. Might not hurt to just add it to the feature list at low priority.

People name (no API on immich server, nor use of them by the server)

It is unfortunate that immich doesn't have an API for this, but there is use. It could be used to autoname the ML facial identification which is a bit of an annoying process. I think just providing that data into the sidecar you're planning would be sufficient. Immich would have to handle that because classification would happen post upload. But if you're providing that sidecar it's only a few more bits and gives them useful data.

EXIFTOOL require the image to be stored in a file

I'm surprised and was googling and not finding evidence to this. I mean the data is part of the file so yes you need to read the file but that data is in the stream so can be processed concurrently. tar xzf test.tgz -O | exiftool - gets me the exif data without writing the file or even tar xzf test.tgz -O | tee newfile.png | exiftool - to write it to a new file and I'm operating on stdio. I know tar was built considering using it for compression through ssh streams.

timeout

I just think a retry loop and flag would be a nice thing to have. I'm still a little confused at the upstream process but if it is rsync or wget like then partial uploading should be recoverable and you can get the upload after a retry or two and solves the issue of even never getting the upload if immich timeouts.

I'm currently exploring some settings but immich is unreasonably hard on my pi. I may not actually be able to get to this as other things are high priority on my plate but my swapping issue seems to be related to concurrency and reducing that has also reduced the swapping issue but it does return. For now I'm probably just going to expand the swapspace. Looks to be related to postgres and the python virtual environment but again, not your issue.

Multitasking would increase the problem.

Parallelism should always be a flag. I agree that it would increase the timeout problem. It's annoying to dynamically adjust parallelism and it's perfectly acceptable to rely on a user to specify the amount of parallelism and let them deal with the consequences.

simulot commented 9 months ago

I think it may be more efficient to do the single pass on the tarballs,

Of course. But reading the tgz is way faster than writing there content, that is faster than uploading them. Anything else like loops, lookup in maps, sort list are negligible.

So maybe this time worth to be spend to solve the puzzle upfront compare to

the "takeout puzzle" can still be solved simultaneous to the extraction.

I don't think this will work. But you can give it a try.

names of photos You should read again all issues about names. It's a nightmare.

lake of people tags sure, but I don't know what can I do before this feature is available in immich.

description granted... when the API accept it

EXIFTOOL require the image to be stored in a file

Here is the point:

tar xzf test.tgz -O | tee newfile.png | exiftool -

You have handled one file.

When you want to handle thousand of files, you should use the batch capabilities. EXIFTOOL is loaded once with the -stay_open option. The it watches file names added in a text file to process them.. So only files on the filesystem.

timeouts

When you look at what immich is doing:

parallelism The contingency is the server. May be some day if I find a nice way to adapt the number of threads to the server capacity.

stevenwalton commented 9 months ago

But reading the tgz is way faster than writing there content, that is faster than uploading them.

These are fair points. But my belief is the only write is the only one via upload and thus can be done through an asynchronous process. But you make a good point that with insufficient memory on the upstream machine this will still stall if the read and write operations are too out of sync but may not be an issue if local machine does have sufficient memory.

There isn't an API for changing date of capture

This is surprising to me. But they do seem to have date issues so I understand why you'd want to process first given this limitation with the alternative being the upstream machine needing to repair exif data.

over 50000 photos, collisions happen when iphone photos

Thank you. This does clarify a case I did not see when in the context of my Google takeout. I did not predict that there could be two files with the same name. I had assumed they needed to be under different directories (meaning a different file name) or have an indicator such as (n). This is indeed a surprising result.

You have handled one file.

Yes, my point was to demonstrate that you can read from a stream. This line can easily be extended to a larger tarball (and I have tested that too), but you're right that the extension needs the exiftool flag. But I assume you wouldn't use exiftool in your code anyways as there are other go libraries that read this metadata. It was just a demonstration of being and to read from a stream (via stdout) because you said you have to write to disk to read this data. You can take the tee out if you don't want the file but that was supposed to proxy an upload. As in, the upstream machine need not process the exif data.

I still stand by that exit data can be extracted from a stream unless we want to consider stdout disk.

I'm not surprised that a PI4 is out of breath

The things you mention are surprisingly fine when concurrency is reduced or eliminated (transcoding is heavy, ML training is offloaded to an external machine (for some reason it's CPU based via the conda lock file), and clip inference removed). The problem is swap gets filled and does not release even though the maximum memory is not being used. RAM should not be low while swap is maxed out. This is still true when swappiness is reduced to recommended levels from default values (but have not checked that this is set in the container). But this is an orthogonal problem so doesn't really need to be discussed here and I'd like to pinpoint the issue a bit more before I bring it up as an official one, which is not an issue for immich-go which does appear to properly allocate resources.

But as to timeouts, that's not a reason to not use retry loops which are fairly standard in streaming processes due to the frequency of network distributions for a variety of reasons. I can absolutely confirm I still get timeouts which I highly restrict the immich server. Retry loops just reduce user disappointment given you see partial success. Better logging is also helpful to make explicit to the user exactly which files failed to upload since they will likely only see the tail of the log and will be truncated by the terminal history if the output is not directed to a log file. It's a simple fix to a nasty and frequent problem. (It's also weird to see green text with a message about a file type not supported, even if intended, as well as one that literally says Error. A sea of green is hard to read)

Parallelism

Most tools leave this as a user flag for this reason.

simulot commented 8 months ago

I have studied in depth the possibility to handle tgz files directly without a prior decompression.

I did few tests with a takeout archive and archived folders... The order of files isn't predicable. It happens that images are associated with a sidcar file (XMP). Both file must be uploaded at the same time into immich, but they can appear in any order, and couple of files aren't aside, or in different files.

Solving the puzzle is not easy when you have data accessible upfront. It become harder if you solve it on the go. You have compose with immich state, what you have learned from previous uploads, and the current file. You may have to correct or delete what you have just uploaded... I'm not ready for that.

Remains the 2 passes option. I have refactored the code in that direction. The collection of structure and metadata is done during the pass one. During the pass two, the decision of uploading a given file is taken depending its metadata, the CLI options, if the file has been uploaded before and so one. The resulting code is cleared, robust ready for parallel reading of multiple archive parts.

I ended up to a working POC. On my own data archived as ZIPs, the pass one and puzzle resolution takes 3 secs before starting the upload. On my data as TGZs, the pass one takes 300 seconds to read the full set of archives and get the puzzle solved.

The gain for this remains small regarding the code complexity. With a more complex code comes the risk of bugs

Finally I stay on my opinion: get ZIPs files or decompress them before the import.