simulot / immich-go

An alternative to the immich-CLI command that doesn't depend on nodejs installation. It tries its best for importing google photos takeout archives.
GNU Affero General Public License v3.0
1.85k stars 55 forks source link

A lot of images skipped from Google Photos Takeout #68

Closed JackBailey closed 11 months ago

JackBailey commented 12 months ago

Running on macOS 13.4 (22F66), immich-go 0.8.2 and immich v1.86.0.

Command:

./immich-go -log-level=INFO -server https://photos.****.*** -key ********** upload -create-albums -use-album-folder-as-name -keep-untitled-albums  -google-photos "/Users/jackbailey/Downloads/Takeout/Google Photos"

Unzipped archive is for checking the files easier, same result if its zipped.

Result:

immich-go  0.8.2, commit 0792ade89f7f556c82225138e36fb9aa6f028576, built at 2023-11-11T20:07:29Z
Server status: OK
Connected, user: jack@*****
Ask for server's assets...
2563 asset(s) received
Browsing google take out archive...Done.

[UPLOAD LOG - No errors, all just uploads]

Managing albums
87 media scanned, 87 uploaded.
Done.

There are 2156 .jpg files and 164 .mp4 files so it should be uploading more than 87.

The INFO log only shows the 87 files being uploaded, no logs about skipping files, duplicates or anything.

Any idea?

simulot commented 12 months ago

Thank to try the project.

1/ The google takeout has lot of duplicates scattered accros the year folders and albums.
2/ Immich-go checks if files are already uploaded in the immich server. if the server has same files or in better resolution, they are skipped

Let me know you have a different case.

JackBailey commented 12 months ago

1) This is takeout file 2 of 2, and I only selected the Photos from 20** albums so there shouldn't be any duplicates 2) I've seen a debug log for that before for my first takeout file, but it's not showing it this time.

simulot commented 12 months ago

The version 0.8.3 provides more log messages.

c0delama commented 11 months ago

I'm having the same problem. In absolute numbers only about half of my photos got imported. When i compare albums side by side on Google Photos and Immich, albums on Google Photos have more.

I also noticed that there are quite a few photos that can't be displayed, but that might be another problem. If needed, i can provide more information.

simulot commented 11 months ago

Sure. I'm working on improving log and report. The idea is to give all the visibility on the program choice.

I have fixed today a problem with live photos that break some photos.

I'll be happy to examine your data. You can DM on discord @ simulot to share detais.

simulot commented 11 months ago

I have found a possible reason for the missing images. The problem comes from the de-duplication method that is a bit agressive. I'm working on a correction for this.

c0delama commented 11 months ago

Sorry that i couldn’t provide more information so far, i had a buys workdays. Is there any information that i can share that would help you fixing the issue? I could do that today.

Btw: I ran the import again (on top of the result of the first import) and it imported 0 photos. So for me the behaviour seems to be deterministic at least.

simulot commented 11 months ago

Thank for you offer. You may check if my assumptions are correct:

I have found a problem with those kind files when placed in the same folder :

/Photos form Year 2016:
DSC_0238_1.JPG
DSC_0238.JPG
DSC_0238(1).JPG

They are all different but the program throws one of them because of a rule to detect edited files like those:

PXL_20220405_090123740.PORTRAIT.jpg
PXL_20220405_090123740.PORTRAIT-modifié.jpg

I'll have a busy day. I'll work on this later this week.

c0delama commented 11 months ago

I don’t think this is the case for me. Here is an example: I have an album that contains six pictures and one video. The bold ones got imported, the others were skipped.

PXL_20230401_062926045.jpg PXL_20230401_063637690.jpg PXL_20230401_052119532.jpg PXL_20230401_063119390.TS.mp4 PXL_20230401_062929337.jpg PXL_20230401_063116440.jpg PXL_20230401_052111882.jpg

In addition to the images, there are the following files in the folder: PXL_20230401_062926045.jpg.json PXL_20230401_063637690.jpg.json PXL_20230401_052119532.jpg.json PXL_20230401_063119390.TS.mp4.json PXL_20230401_062929337.jpg.json PXL_20230401_063116440.jpg.json PXL_20230401_052111882.jpg.json Metadaten.json

Metadaten.json contains

{
  "title": “Ana 34er",
  "description": "",
  "access": "protected",
  "date": {
    "timestamp": "1680378442",
    "formatted": "01.04.2023, 19:47:22 UTC"
  },
  "location": "",
  "geoData": {
    "latitude": 0.0,
    "longitude": 0.0,
    "altitude": 0.0,
    "latitudeSpan": 0.0,
    "longitudeSpan": 0.0
  }
}

I was using v0.8.3 of the Windows client with the following options:

immich-go -log-level INFO -server=http://xxx.xxx.xxx.xxx:xxxx -key=xxxxxxxxx upload -create-albums -google-photos X:/xxxxx/takeout-*.zip

In total 33529 items were scanned and 12952 have been uploaded. Let me know if that helps.

simulot commented 11 months ago

Thank for this information. I'll check how this could fails.

In parallel I'm working on a better integration report that will explain how each files have been handled.

simulot commented 11 months ago

@c0delama

Google Albums with photos shared by other users are not int the takeout file

That's the way it is. To get photos shared by other user, you have to go to google photos' album page and download all. But you'll get low resolutions photos.

Your example

I don’t think this is the case for me. Here is an example: I have an album that contains six pictures and one video. The bold ones got imported, the others were skipped.

PXL_20230401_062926045.jpg PXL_20230401_063637690.jpg PXL_20230401_052119532.jpg PXL_20230401_063119390.TS.mp4 PXL_20230401_062929337.jpg PXL_20230401_063116440.jpg PXL_20230401_052111882.jpg

In addition to the images, there are the following files in the folder: PXL_20230401_062926045.jpg.json PXL_20230401_063637690.jpg.json PXL_20230401_052119532.jpg.json PXL_20230401_063119390.TS.mp4.json PXL_20230401_062929337.jpg.json PXL_20230401_063116440.jpg.json PXL_20230401_052111882.jpg.json Metadaten.json

I can't tell what is happening in this basis. I'm working on the generation of a journal that will give actions taken for all files.

My own example

I'm working on the journal tool to make problems visible

c0delama commented 11 months ago

As soon as you have something, i'm happy to try it out and give feedback.

simulot commented 11 months ago

@c0delama the version 0.8.7 has better logs, and can write everything to a file. Use

-key .... -server ...  -log-level=INFO -log-file=immich-go.log upload ...
c0delama commented 11 months ago

Okay, so i ran it only on the month that had the pictures of the album i've described in my previous post.

I've used the following command:

immich-go -log-level INFO -server=http://xxx.xxx.xxx.xxx:xxxx -key=xxxxxxxxxxxxxxxxxxxxxxxxxx -log-file=immich-go.log upload -create-albums -google-photos -date=2023-04 X:/xxxx/takeout-*.zip

Here is the output (stripped to the logs referring to the album in question):

immich-go  0.8.7, commit 875d965251a1fcd1fc8ab87f2a2cbe0b13ec3f64, built at 2023-11-26T14:59:12Z

Server status: OK
Connected, user: xxxxx@xxxx.xxx
Ask for server's assets...
12988 asset(s) received
Browsing google take out archive...
Scanning the Google Photos takeout
Scanned                  : Takeout/Archiv_Übersicht.html: 

...

Metadata files           : Takeout/Google Fotos/Ana 34er/Metadaten.json: Album title: Ana 34er
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg.json: Title: PXL_20230401_052111882.jpg
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg.json: Title: PXL_20230401_052119532.jpg
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_062926045.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_062926045.jpg.json: Title: PXL_20230401_062926045.jpg
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_062929337.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_062929337.jpg.json: Title: PXL_20230401_062929337.jpg
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg.json: Title: PXL_20230401_063116440.jpg
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_063119390.TS.mp4: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_063119390.TS.mp4.json: Title: PXL_20230401_063119390.TS.mp4
Scanned                  : Takeout/Google Fotos/Ana 34er/PXL_20230401_063637690.jpg: 
Metadata files           : Takeout/Google Fotos/Ana 34er/PXL_20230401_063637690.jpg.json: Title: PXL_20230401_063637690.jpg

...

Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg: An asset with the same name:"PXL_20230401_052119532", date:"2023-04-01 07:21:19" and size:2.1 MB exists on the server. No need to upload.
Info                     : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg: Added to album: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063637690.jpg: PXL_20230401_063637690.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063637690.jpg: An asset with the same name:"PXL_20230401_063637690", date:"2023-04-01 08:36:37" and size:2.8 MB exists on the server. No need to upload.
Info                     : Takeout/Google Fotos/Ana 34er/PXL_20230401_063637690.jpg: Added to album: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_062926045.jpg: PXL_20230401_062926045.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_062926045.jpg: An asset with the same name:"PXL_20230401_062926045", date:"2023-04-01 08:29:26" and size:4.0 MB exists on the server. No need to upload.
Info                     : Takeout/Google Fotos/Ana 34er/PXL_20230401_062926045.jpg: Added to album: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063119390.TS.mp4: PXL_20230401_063119390.TS.mp4
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063119390.TS.mp4: An asset with the same name:"PXL_20230401_063119390.TS", date:"2023-04-01 08:31:42" and size:20.8 MB exists on the server. No need to upload.
Info                     : Takeout/Google Fotos/Ana 34er/PXL_20230401_063119390.TS.mp4: Added to album: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_052111882.jpg: PXL_20230401_052111882.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg: already on the server
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_062929337.jpg: PXL_20230401_062929337.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_062929337.jpg: already on the server
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063116440.jpg: PXL_20230401_063116440.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg: already on the server
...
Managing albums
Update the album Ana 34er
...
Upload report:
103466 scanned files
103317 handled files
 48089 metadata files
     0 uploaded files on the server
     0 upgraded files on the server
 19345 duplicated files in the input
     9 files already on the server
     0 discarded files because in folder failed videos
 33649 discarded files because of options
     0 discarded files because server has a better image
  2225 files type not supported
     0 errors
   149 files without metadata file
2374 files can't be handled

So the interesting ones are these:

...
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg: already on the server
...
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_062929337.jpg: already on the server
...
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg: already on the server
...

I can confirm that these images are in the file system (so they must have been uploaded), but Immich doesn't show them. Is this an issue with the upload or with Immich? Unfortunately i don't know if they had been in the file system before todays upload, but considering the log, i'd say so.

I have then deleted all pictures of that album including the album from Immich and the file system. I've started the script again, but now it only uploaded 3 images:

Uploaded                 : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg: PXL_20230401_052111882.jpg
Added to an album        : Takeout/Google Fotos/Ana 34er/PXL_20230401_052111882.jpg: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_052119532.jpg: PXL_20230401_052119532.jpg
Uploaded                 : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg: PXL_20230401_052119532.jpg
Added to an album        : Takeout/Google Fotos/Ana 34er/PXL_20230401_052119532.jpg: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063116440.jpg: PXL_20230401_063116440.jpg
Uploaded                 : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg: PXL_20230401_063116440.jpg
Added to an album        : Takeout/Google Fotos/Ana 34er/PXL_20230401_063116440.jpg: Ana 34er
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063119390.TS.mp4: PXL_20230401_063119390.TS.mp4
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063119390.TS.mp4: already on the server
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_062926045.jpg: PXL_20230401_062926045.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_062926045.jpg: already on the server
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_062929337.jpg: PXL_20230401_062929337.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_062929337.jpg: already on the server
Local duplicate          : Takeout/Google Fotos/Photos from 2023/PXL_20230401_063637690.jpg: PXL_20230401_063637690.jpg
Server has photo         : Takeout/Google Fotos/Ana 34er/PXL_20230401_063637690.jpg: already on the server

Also on the filesystem, there are just 3 files now.

Probably unrelated to this problem, but i also get a lot of these:

File: Takeout/Google Fotos/Photos from 2021/PXL_20210103_152410869.MP
    File type not supported

Can you support copying .mp files?

And these:

File: Takeout/Google Fotos/xxxxx/original_448db788-ce38-479d-b6b4-2c1e0d5118d9_2(1).JPG
    File unhandled, missing JSON

Is it possible to use the folder name only for those images that don't have a JSON, but use the JSON as a primary source for metadata? That's photos that i didn't make myself, but uploaded manually to a Google Photos album.

simulot commented 11 months ago

Thank you for your report.

Files present in your server

There is an easy check: drop one of the missing file on the immich web page. If it is accepted, this means there is a flaw in the immich-go.

Immich-go starts its work by getting the totality of the server's asset. It keep in memory the title of the asset, its date of capture and the file size in bytes.

Then it scans for JSON files. For each of them, it tries to find the asset file(s) related to it. The actual name and the date of capture are taken from the JSON, the file size from the files. Then the tuple File Name + Date of capture is searched among server's assets. Then the file size are compared. When Name, Date and size are matching, immich-go declare the server has already the file.

Deleting files from the album

In Immich, you have to remove the images from the album, then delete the image, and then empty the trash. I have been bitten by this several times. Is it your case?

MP support

MP files are not recognized by immich server, and are useless for live photos. The #82 explains the problem.

Files without JSON

This is the trickiest part of the google photos import. I have tried 2 different approaches so far:

1- scan assets (photos and movies) then search for a suitable JSON But some albums have JSON but the asset is in the Year folder. Albums was missing those after the upload.

2- scan JSON then search assets that could match each of them It seams to be the reasonable thing to do. At least albums are complete.

But the rules that link the JSON with actual files aren't univoque. There are few basic rule that works for most of them. But they are wired cases when the name is shorter by one char, or when images has been edited in google photos. And this become totally insane with long file names... Alas the provided JSON doesn't give enough information for finding all files.

I have added lot of rules to address basic cases, exceptions, corner cases. Ssometime the rule for handling the exception of the exception interferes with a basic one, leading to mis association.

I'm wondering if I should give up finding the json for them, and just upload those files.

c0delama commented 11 months ago

ad) Files present in your server It might be an issue with Immich and not with the upload after all. The problem is, that i can't see those files in Immich. When i manually upload the photos, i'll end up with the same 5 (instead of 7) photos again. Interestingly the album claims that it contains 7 photos, while just 5 are shown.

ad) Files present in your server Yes, i have deleted them in Immich, emptied the trash can, deleted the album and removed the images from the file system (from the library folder).

ad) MP support If .mp is simply changed to .mp4, Immich recognises them as short videos. Maybe that could be worth a try?

ad) Files without JSON To me, not uploading them is not a solution. The images are there for a reason. Regarding the rules, it sounds a bit as if you've started coding without laying out a plan first. Instead of patching your existing code, maybe it would make sense to make a step back and look at it from a distance, come up with pseudo code and then only implement it e.g.

- For every file:
-- does it already exist in the map?
---- no -> use the name without extension as a map key
-- if it is an image/video
---- is /image/video path already stored?
------ no -> store image/video path
------ yes -> skip (here you could also change the path if the image is better?!)
-- if it is a json
------ no -> store json path
------ yes -> skip
-- is a metadata.json available
---- yes -> store album name

- For every entry in the map:
-- is album name stored?
---- no -> use folder name

- For every entry in the map:
-- upload to immich

I'm aware that in reality it is more complex, but couldn't it work if you lay out such an approach?

simulot commented 11 months ago

ad) Files present in your server It might be an issue with Immich and not with the upload after all. The problem is, that i can't see those files in Immich. When i manually upload the photos, i'll end up with the same 5 (instead of 7) photos again. Interestingly the album claims that it contains 7 photos, while just 5 are shown.

Have you tried the repair procedure ? (admin page)

ad) Files present in your server Yes, i have deleted them in Immich, emptied the trash can, deleted the album and removed the images from the file system (from the library folder).

The movie is actually embedded into the JPG and immich use it to animate the photo. We can safely ignore them.

ad) Files without JSON To me, not uploading them is not a solution. The images are there for a reason. Regarding the rules, it sounds a bit as if you've started coding without laying out a plan first. Instead of patching your existing code, maybe it would make sense to make a step back and look at it from a distance, come up with pseudo code and then only implement it e.g.

This is probably true for the 1st lines of codes. The problem is looking simple: Folders full of images and JSON.

Until you discover how the takeout archive is a mess. Here are some of problems of the takeout archives:

Big archives are split into several zip files, and related files may reside in different files.

Same photos are in Year folder and album folders... but not always...

Your iphone save images like IMG_NNNN.JPG... and the counter restart at 001 after 999 images.. You ends up with 2 images with the same name to be saved into the same Year folder: IMG_0565.JPG and IMG_0565.JPG.JSON for the first and IMG_0565(1).JPG and IMG_0565.JPG(1).JSON for the second...

The files names are changed to avoid forbidden chars on certain platforms. Real names are in the JSON. Files names are shorten to the limit of 46 UTF-16 (I like this one!) chars when file are encored in UTF-8 in the zip archive. But the JSON is one UTF-16 char shorter than the image name. If the last char is an emoji...

Sometimes the json has a shorten the extension: longfilename.jp.json associated with longfilename.jpg.

One user has 400+ photos of his weeding, all named like: Backyard_ceremony_wedding_photography_johndoe_photostudios-NNN.jpg All been shortened into Backyard_ceremony_wedding_photography_j(XXX).jpg where XXX aren't the NNN, the correct matching name in the JSON's title. Note this doesn't follow the rule for iphone's names...

Even metadata.json have problems. In German; metadaten.json, in Spanish metadades.json, metadatos.json in Galician, no idea in Korean, or Hungarian...

Untitled album are name "Sans Titre" in French, "Senza titolo" in Italian...

Edited images are postfixed with -modifé in French... and share the same JSON than the original one...

And there are few more of that kind.

I have completely rewritten the google photo code twice since the 1st version. Each time was a great improvement, but there are still problems. Remember that takeout files don't provide all we need for doing the job.

I'll be happy to merge a pull request that fixes all of them.

simulot commented 11 months ago

@c0delama it took some time, but I expect having fixed it.

c0delama commented 11 months ago

I'm sorry for not being able to answer earlier.

Have you tried the repair procedure ? (admin page)

Nice one! I didn't see that before. Unfortunately didn't help in my situation.

The movie is actually embedded into the JPG and immich use it to animate the photo. We can safely ignore them.

Cool, i didn't know that. Thanks!

This is probably true for the 1st lines of codes. The problem is looking simple:

I didn't mean to make it look simple. I fully acknowledge that this is a complex problem and you're doing a great job! What maybe could improve the process is an interactive mode that lets the user help you finding the best strategy for their respective library e.g. names of metadata files, choosing an album name if ambigous, maybe even confirming if XXX and NNN refer to the same image. It is a tedious process, especially for big libraries, but since users go through the work of self hosting immich, they also might be patient enough to answer questions about their library if the result is an improved migration. It is just an idea and the "interactive mode" could always be just an additional option to the already existing "automatic mode".

Awesome that you have fixed it! I've just completely nuked my Immich instance and started a new migration with the newest version of immich-go. It always takes a day but i'll let you know of the result when its done.

c0delama commented 11 months ago

The import worked much better this time, great 🙌

simulot commented 11 months ago

Google takeout files are full of traps...

Let me know how photos from users users are stored into the takeout file.

The album name is determined with the title field of the metadata.json file in the album folder. If you have 2 albums with same title, photos will be assigned to the same album in immich. But this an option for that

JackBailey commented 11 months ago

Looks like way more of my photos are imported now. Thank you very much for your hard work.

simulot commented 11 months ago

I'd be interested to known which photos are forgotten by immich-go.