ivandokov / phockup

Media sorting tool to organize photos and videos from your camera in folders by year, month and day.
MIT License
820 stars 100 forks source link

Make duplicate file assignment deterministic. #222

Open MiningMarsh opened 7 months ago

MiningMarsh commented 7 months ago

I've been using phockup to organize a set of camera photos and videos, which I then copy into a flat file structure. I then synchronize both the flat and non-flat organized directories back into the source files using rsync. Lastly, I synchronize both of these files to my phone's camera role with Foldersync for Android.

I've noticed that some files are synchronized every single transfer. Rsync shows a sync for every single file due to checksum and size difference, and when I compare two example files, they show a single byte different:

$ cmp /var/{tmp,share}/camera/20230613-174230-2.mp4                                                        /var/tmp/camera/20230613-174230-2.mp4 /var/share/camera/20230613-174230-2.mp4 differ: byte 807014, line 3188

Tracking this down a bit more, I've noticed this only seems to happen to files that phockup detects as having the same timestamps, ones that generate filenames with the -# appended to the end of them. Some of my files do appear to be different but have the same timestamps. As far as I can tell, the order that phockup decides to assign the duplicate files a discriminating number does not appear stable, so, i.e., one iteration one file might be assigned -2, and another iteration a different file might be assigned -2.

Is it possible that the sorting method of identified duplicates could be made stable, so that in these cases, the file contents will not change each iteration? I can fix this on my end by removing duplicates, but it seems suboptimal that phockup might cause files to swap places with every iteration.

rob-miller commented 7 months ago

are you using concurrency?

MiningMarsh commented 7 months ago

I did enable concurrency, 32 cores.

rob-miller commented 7 months ago

As @ivandokov has not commented, could you try without concurrency and see if the issue remains? Also curious if you can report how much speed impact this has for you.

MiningMarsh commented 7 months ago

I can't test the stability issue right at the moment (I already removed the duplicates to resolve the issue on my end as the constant syncing was causing me problems, so I'll need to create some test data and get back to you), but I can give you the speed difference right now. This is on a Ryzen 5950x processor.

With 32 cores:

[2023-11-18 20:02:35] - [INFO] - Processed 1796 files in 40.08 seconds. Average Throughput: 44.81 files/second
[2023-11-18 20:02:35] - [INFO] - Copied 1796 files.

With 1 core:

[2023-11-18 20:10:47] - [INFO] - Processed 1796 files in 306.53 seconds. Average Throughput: 5.86 files/second
[2023-11-18 20:10:47] - [INFO] - Copied 1796 files.

I'll try and test the issue without concurrency tomorrow.

ivandokov commented 6 months ago

The way the folders are traversed with concurency is causing this issue. Unfortunatelly I didn't build this feature and I am not really sure how to fix it.