immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
49.64k stars 2.62k forks source link

[BUG] Android App duplicate with external library #4413

Closed toxic0berliner closed 12 months ago

toxic0berliner commented 1 year ago

The bug

Given the warning to not use immich as the sole backup app for your pictures, I am still using an external app that backups all my pictures from my android phone to my NAS. I just moved from a custom importer script to the external library feature.

But now, immich is not able to recognize anymore that the same picture is on my phone and on the server. I get a duplicate for each picture, one with a cloud only icon for the one on the server, and one with a crossed cloud for the one on my phone.

In the past I used to get a proper deduplication with a single picture and a checkmark inside the little cloud icon.

Maybe something broke and external libs are not matched against the local android pictures ?

The OS that Immich Server is running on

Docker image running on ubuntu 22.04

Version of Immich Server

v1.81.1

Version of Immich Mobile App

1.80.0 build.104

Platform with the issue

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - immich-upload:/usr/src/app/upload
      - orion-photo:/mnt/orion/photo
    env_file:
      - stack.env
    depends_on:
      - redis
      - database
      - typesense
    restart: always
    networks:
      immichnet:
        aliases: 
          - immich-server

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "microservices" ]
    volumes:
      - immich-upload:/usr/src/app/upload
      - orion-photo:/mnt/orion/photo
    env_file:
      - stack.env
    depends_on:
      - redis
      - database
      - typesense
    restart: always
    networks:
      immichnet:
        aliases: 
          - immich-microservices

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - model-cache:/cache
    env_file:
      - stack.env
    restart: always
    networks:
      immichnet:
        aliases: 
          - immich-machine-learning

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - stack.env
    restart: always
    networks:
      immichnet:
        aliases: 
          - immich-web

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
    logging:
      driver: none
    volumes:
      - tsdata:/data
    restart: always
    networks:
      immichnet:
        aliases: 
          - typesense

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always
    networks:
      immichnet:
        aliases: 
          - redis

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - stack.env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      PG_DATA: /var/lib/postgresql/data
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always
    networks:
      immichnet:
        aliases: 
          - database

  photo:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    labels:
      traefik.http.services.photo.loadbalancer.server.port: 8080
      traefik.docker.network: traefiknetwork
      subdomain: photo
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL=${IMMICH_SERVER_URL}
      - IMMICH_WEB_URL=${IMMICH_WEB_URL}
    #ports:
    #  - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always
    networks:
      traefiknetwork:
        aliases: 
          - photo
      immichnet:
        aliases: 
          - immich_proxy

volumes:
  pgdata:
    driver: local-persist
    driver_opts:
      mountpoint: ${BASE_VOLUMES}/${STACKNAME}/pgdata
  model-cache:
    driver: local-persist
    driver_opts:
      mountpoint: ${BASE_VOLUMES}/${STACKNAME}/model-cache
  tsdata:
    driver: local-persist
    driver_opts:
      mountpoint: ${BASE_VOLUMES}/${STACKNAME}/tsdata
  orion-photo:
    driver: local-persist
    driver_opts:
      mountpoint: ${BASE_ORION}/photo
  immich-upload:
    driver: local-persist
    driver_opts:
      # mountpoint: ${BASE_ORION}/photo/immich
      mountpoint: ${BASE_ORION}/docker/volumes/immich
networks:
  traefiknetwork:
    name: traefiknetwork
    driver: bridge
    external: true
  immichnet:
    name: immichnet
    driver: bridge
    external: false
    attachable: true

Your .env content

STACKNAME=photo
BASE_VOLUMES=/var/lib/docker/local-persist
BASE_ORION=/mnt/orion
PUID=5678
PGID=100
TZ=Europe/Paris
UMASK=0
LOCAL_NETWORK=192.168.0.0/16
REALHOST=myhostname
DB_HOSTNAME=immich_postgres
DB_USERNAME=myuser
DB_PASSWORD=mypassword
DB_DATABASE_NAME=immich
REDIS_HOSTNAME=immich_redis
UPLOAD_LOCATION=${BASE_ORION}/photo/immich
TYPESENSE_API_KEY=myAPIKey
PUBLIC_LOGIN_PAGE_MESSAGE=
IMMICH_WEB_URL=http://immich-web:3000
IMMICH_SERVER_URL=http://immich-server:3001
IMMICH_MACHINE_LEARNING_URL=http://immich-machine-learning:3003
LOG_LEVEL=debug

Reproduction steps

0.backup your android pictures to the future external library folder
1.spin up the stack, add a user and the external library, let it discover all pictures
2.start the android app, login, let it scan local pictures 
3.all pictures are show twice

Additional information

No response

alextran1502 commented 1 year ago

I think this is not the intended use case. The external library is used for existing libraries while uploading assets will go into the default library.

toxic0berliner commented 1 year ago

Damn, it would be a bit sad if that's the case. I trashed my previous install... 50k pictures, with many faces, takes over 3 days to scan and several weeks to ignore the over 40k faces and rename all my friends....

I'm really not sure I'm ready or even should move everything to the immich primary library... Really difficult to add the dedup algorithm to external libraries ?

It was working fine in the pas with my custom script that imported into the library with an external path... But that started to fail as well mid September (not importing new ones) so I thought external library would be best.

Even if I were to switch to immich as primary app including for backup, I have over 250GB of pictures on my phone, not really looking forward to moving it on my NAS from where they are to immich....

toxic0berliner commented 1 year ago

I tried to not grant the permission to use Android pictures but the app keeps asking, so can't use external library at all as long as there is any overlap with the content of the phone, can't use the app without it seeing local android pictures... Makes it unusable for me. I'm thankfully not your only user and you don't really need me, sure, but I fail to see why external library really shouldn't be treated as the primary library in case the picture on the phone is already on the server in an external library...

Was liking the face recognition, places, timeline, overall swiftness of the UI. I can't believe I'm the only one with such need but I'm also not ready to fork or PR to fix it as I'm a bad dev, so I hope I can convince you 😁

alextran1502 commented 1 year ago

I am not sure what you are trying to achieve, from my POV you can

toxic0berliner commented 1 year ago

I have 250gb of pictures already on my phone and already on the NAS where I run immich. Just trying to use Immich and not move all my existing pictures. The NAS also store some pictures and movies that I remove from my phone since then. So ideally I'd import all existing files AND enable backup, all to the primary library, but that would mean moving or duplicating over 500gb of pictures and videos...

So I'd really like instead to keep the existing files where they are, not enable the backup as the one I already have works fine, but still be able to use Immich to see and analyse all my pictures and be able to share them with friends.

This is why I would need the external library AND the android photos to work together and not show up twice, else I'll not use Immich on my phone, not invest time in "maintaining" it and ultimately it'll end up fully unused.

mattjmeier commented 1 year ago

I think this is an important issue. I am also experiencing it (while loving Immich overall!) and fully agree.

I'm sure many people possess duplicated photos in their external libraries for a variety of reasons. Some of those reasons may be vestigial or even superfluous. In my personal case, even the result of laziness.

Obviously, there are other deduplication methods that could take care of things like the duplicated folders. But for people with larger photo collections (mine is ~100k), that is a lot to manage and go through. I love the idea of having the Immich UI put all the photos into a timeline for me without too much intervention. It is working so incredibly well!!

As I have pointed out (https://github.com/immich-app/immich/discussions/4240#discussioncomment-7180105) I think there is a relatively simple solution to this: don't display two images in the timeline that share the same file checksum. Why would this ever be the desired behavior? If they are identical images, then I am confident that no one would want them displayed adjacent to each other in the timeline. If there are reasons someone would want this, I am very curious to hear it.

How could a solution be implemented? I propose that they could either be considered a type of 'stack' (i.e., keep the assets tracked separately, but displayed as one), or alternatively, subjected to the same checksum searching that already applies to the "Upload" library (i.e., consider it a single asset). The former option could give users more flexibility, the latter may be easier to implement.

I love Immich and hope to continue using it! I really feel strongly about this though. I would be willing to help out with a PR, although the learning curve would be really steep for me as I am not familiar with the languages used in Immich.

Thanks for everyone's continued efforts on this amazing project!!

jrasm91 commented 1 year ago

Libraries don't currently use checksums since they are the "source of truth" and there is a significantly negative performance impact to generating hashes on large libraries. Even if we had them, checksums have to be unique in the database and now you still have the complexity of managing what file do you keep and which one do you ignore, how do you manage that on rescan or file moves, etc. There are also priorities for libraries like automatic album creation. I guess long story short, probably not going to be addressed anytime soon and you are better off using a proper dedupe tool instead.

mattjmeier commented 1 year ago

Ahh, thanks for the insight and taking the time to reply.

So if I understand correctly, the upload library is specially designated to calculate the sha1 hash for the assets in it, but external libraries are not.

The part I am not understanding is how the resources required would be any different if I uploaded 100k photos from my phone. If I did this, hypothetically, the hashes would be calculated and presumably recorded in the db. But this isn't possible for the external libraries?

And I guess what you are saying about being unique in the database means that two assets cannot share a checksum because it's a primary key. This makes sense*. I suppose it would make sense to me intuitively that two duplicate photos (with the same checksum) could be represented by a single asset in the database (since it essentially is). Perhaps it would also start to violate other rules about fields in the database - e.g., can't have more than one file path per asset, likely? I can see how problems would start to pile up.

I can also definitely understand that people running this on a raspberry pi wouldn't find it desirable to run checksum calculations for days on end.

I'm curious how photoprism implements this feature (https://docs.photoprism.app/user-guide/library/duplicates/ - they are checking sha1 for every file on import to detect). It is one of the few things it does better - automatically stacking assets when it makes sense to do so (i.e., raw + jpg version; identical images; etc.). I understand this is getting outside the scope of what Immich was designed to do. It's just that it's so awesome at doing everything else it is so tempting to integrate this feature.

I also share @toxic0berliner's concerns regarding dropping other backup methods. I am currently using Nextcloud for auto backups from mobile. I would be happy to lose this method, but it works and is stable for now. So, perhaps something for the future.

I get the impression that there are many users facing the same issue though, because a lot of people are going to be using external libraries like this, and many people WILL have duplicates as I've described, and many will have other methods of backups too. I'm not trying to put more on the current developers' shoulders, just sharing my experience.

I still come back to the same question: why would any user want duplicate images sharing a sha1 hash displayed in the timeline? It seems as simple (ha... I know, is it ever simple) as offering the option to calculate hashes; recording it in a table in the database; and picking one as the primary asset to display and generate thumbs for (the first one by mtime? literally doesn't matter).

*(EDIT: actually I'm not sure anymore how this is possible, because I do have duplicates in the timeline, meaning they would have the same hash... I obviously do not have a good grasp of how this is all working in the back end, although it's clear that hashes are not calculated for both duplicates)

jrasm91 commented 1 year ago

External libraries are quite different than upload libraries and we have separate implementations, which reflect each use case.

Upload libraries have immich as the source of truth and it manages creating and deleting files and deduping them.

External libraries have the file system as the source of truth and so we leave creating, deleting and deduping files to the user. Deduping has different semantics in this context and the implementation would be quite different. We realized that by not having hashing it is significantly faster to import an external library, so we didn't add it.

It is not to say hashing and other deduping cannot be done, it is more that it is not trivial as it seems and specifically because there were benefits to excluding it (simpler implementation) we didn't include it originally.

Checksum is a required field, but the value for external library files is just a hash of the file path instead.

jrasm91 commented 1 year ago

I don't think any user wants duplicates in their external libraries, but they do want external libraries and they got them sooner at the expense of no dedupe checking.

mattjmeier commented 1 year ago

Totally fair! Happy to have it, because that is what drew me in as a user.

Pre-existing duplicates I agree are a separate problem with no easy answer. It was just a surprise that backing my photos up through a separate mechanism (which is indeed recommended upfront in the documentation) results in duplicate uploads from the mobile app to my library.

I would be really interested to learn more about how the current implementation works to check duplicates against images in the upload_location but the code base is massive and I didn't have any luck trying to search on my own. Any pointers on where to look?

Side note: why not use md5 rather than sha1 since it's a bit less computationally expensive? (EDIT: I guess the speed is fairly comparable, but you get more bits from sha1...)

jrasm91 commented 1 year ago

Pre-existing duplicates I agree are a separate problem with no easy answer. It was just a surprise that backing my photos up through a separate mechanism (which is indeed recommended upfront in the documentation) results in duplicate uploads from the mobile app to my library.

Honestly, there seem to be two main types of users using Immich right now:

  1. I want immich to backup and organize my photos for me.
  2. I have my own collection of photos I'll give you read only access to them, don't touch them.

Immich was originally designed to work exactly like google photos. With google photos you don't have an option 2 available in the first place. But, there are lots of people looking for self-hosted photos with use case 2 in mind, so libraries was added (after the fact) to accommodate that user group. Upload libraries are really for group one and external libraries are really for group two.

While we want to support more use cases, photo management software is indeed complicated. I'd say, currently at least, using the upload library and the external libraries in tandem in not a great experience and I think most people are only using one or the other right now. I'm sure it will improve in the future, but it is a current limitation. It's still unclear exactly how they should/will be integrated in the future. There are talks of migrating "partner sharing" to be library based and other stuff like that.

I would be really interested to learn more about how the current implementation works to check duplicates against images in the upload_location but the code base is massive and I didn't have any luck trying to search on my own. Any pointers on where to look?

Side note: why not use md5 rather than sha1 since it's a bit less computationally expensive? (EDIT: I guess the speed is fairly comparable, but you get more bits from sha1...)

Long story short, it is the version Alex picked when he started building, probably because he is not a crypto expert and just made a decision and moved on. By the time more contributors started working on the project sha1 was already widely incorporated into the project and it would take a bit of effort to migrate to another algorithm. The benefits of migrating simply was not worth the time and effort. Basically, migrating has minimal impact on the users of the system, but delays other more critical features that we've decided to build instead. So like, do you want to migrate to md5 or get a better search system, a stacked photos implementation, a more robust dedupe implementation, automatic albums for external libraries, etc. We've decided those features are more important than the algorithm we use for hashing. Sha1 is pretty performant still and on some machines is a single cpu instruction.

mattjmeier commented 1 year ago

Thanks so much for all the details. I really appreciate you taking the time! I understand the nuances a lot better now.

I would place myself somewhere between 1 and 2... I do want Immich to be my mobile backup & organization/UI/sharing solution (i.e., a replacement for google photos, obviously), but I also have a large collection of photos, and like the granularity of being able to provide various volumes across various physical locations and not worry about it destroying my collection while the app is in development. I would happily enable a longer processing time to have duplicate detection (but, I have a reasonably powerful server to do this, which many users might not).

I guess the solution for me is to disable the Immich mobile upload entirely until there is progress on this front and rely on 3rd party tools, then clean up the existing duplicates as required, which is easy enough to do (well worth the effort to keep using the excellent application). I suppose that will work, thanks for helping me reach that conclusion - hopefully this discussion helps others too.

I'm happy to continue the discussion if I think of anything productive.

alextran1502 commented 1 year ago

Thanks, @mattjmeier and @jrasm91, for a very productive conversation.

jrasm91 commented 1 year ago

I think that sounds like a good solution in the interim while we continue to work out the kinks around libraries and figure out how to tackle your use case. Thanks for being understanding as well, it is refreshing :pray:.

jrasm91 commented 1 year ago

I think adding an optional feature for "library hashing" could be something we look at in the future as well.

alextran1502 commented 12 months ago

Conver to discussion/feature request this as this is not a bug but the current intention. Future optimization might address this issue