Plugin potentially causing duplicates?

ajquick commented 1 year ago

I don't have concrete proof of this, but I believe there may be a bug causing some images to be assigned to the wrong entry in the database. This appears to happen when one of the following things are true:

Very large number of photos
Very large photo file sizes
No md5hash (I add the hashes before I run Virtualize now)

Before running Virtualize:

Photo 001 - File /galleries/album/photo-1.jpg Photo 002 - File /galleries/album/photo-2.jpg Photo 003 - File /galleries/album/photo-3.jpg Photo 004 - File /galleries/album/photo-4.jpg

After running Virtualize:

Photo 001 - File /upload/etc/photo-1.jpg Photo 002 - File /upload/etc/photo-1.jpg Photo 003 - File /upload/etc/photo-3.jpg Photo 004 - File /upload/etc/photo-3.jpg

Basically for whatever reason, some photos end up with another photo's file. I cannot confirm how this happened... but I don't think the images had their md5 hashes and I was doing it with a LOT of very large images. (It should also be noted that these images were not the same).

Ideally a process would be added to only do a few images at a time as per the suggestion in #4.

ajquick commented 1 year ago

Looking at the code. It is possible for this to happen in the while loop if for example, the md5_file function throws an error.

I believe the md5 hash from the previous file would then be used rather than exit with an error?

Should a new md5 be generated if one already exists in the database?

Ex:

$file_for_md5sum  = $row['oldpath'];
$md5sum = md5_file($file_for_md5sum);

Replace with:

...
  $query = '
SELECT
    path AS oldpath,
    date_available,
    representative_ext,
    id,
    md5sum
  FROM '.IMAGES_TABLE.'
  WHERE path NOT LIKE \'./upload/%\'
;';
...

if(!empty($row['md5sum']){
    $md5sum = $row['md5sum'];
}else{
    $md5sum = md5_file($row['oldpath']);
}

if(!empty($md5sum) && !is_null($md5sum) && $md5sum !== false){
  ...
 //continue

ajquick commented 1 year ago

I implemented the above changes and the code executed on 50G of files nearly instantly. Previously it was doing the md5 hash for every single file so it would have taken a minute or two.

plegall commented 3 weeks ago

Hi @ajquick

for whatever reason, some photos end up with another photo's file. I cannot confirm how this happened

I know why. First we need to understand how the path is build for a virtualized photo. It's upload/<year of date_available>/<month of date_available>/<day of date_available>/<digits of date_available>-<first 8 chars of the md5sum>.extension. The original filename has absolutely no impact on the path, except for the extension. So for a photo added by synchronization on 2014-09-16 21:26:47 with an md5sum=99ce38dc0f4c98d4de2328986f80819a, the path will be upload/2014/09/16/20140916212647-99ce38dc.jpg.

For a given file, you always get the same md5sum. The risk to have two differents files with the same md5sum is very low. But much less low if you take only the first quarter of its length. Let's say it's still very very low.

I often use Virtualize to move a Piwigo from self-hosting to Piwigo.com and here are the 2 most common situations:

the file is empty. filesize=0. Not a single byte. In that case all empty files will have the same md5sum. A part of the path will generate duplicates
the same photo has been added in 2 different directories during the very same synchronization. For example galleries/album1/img_0001.jpg and galleries/album2/img_0001.jpg (consider it's the exact same file, not just the same filename).

When 2 photos are added during the same sync, they get the same date_available. The uniqueness of the built of the path becomes very vulnerable.

Having the same images.path for 2 different photos is a real problem. One day you will use the Batch Manager to "deduplicate" photos, obviously find 2 photos sharing the same path (because they have the same md5sum), delete one of the duplicate, and the file will have been removed also for the legit remaining photo. That prevents me to sleep with appropriate comfort.

So what needs to be done to prevent duplicates on path? During the virtualization, we need to check. If the path already exists, we need to change it for the photo being virtualized. We can replace the first 8 chars of the md5sum by strtolower(generate_key(8)) and check again if it already exists...

About the pre-calcuation of the m5sum: yes that's a good idea to check if the md5sum is already available. An even better idea would be to add a blocking warning:

some md5sum are missing, please compute them first (with the appropriate link to the Batch Manager prefiltered on photos without checksum).

plegall commented 3 weeks ago

... and also: now that I know there are some duplicate paths in the wild, we have to find them and deduplicate them. I wrote a quick and dirty script for a client, but that should be an "advanced" maintenance task maybe.

plegall / Piwigo-virtualize

Plugin potentially causing duplicates? #8