jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Consider ways to make the listing operations (ls, dups, uniques) more efficient #46

Open lispstudent opened 1 year ago

lispstudent commented 1 year ago

Hello,

I am very glad to have found dupd, as it offers the best workflow for my use-case.

I have run the following command, on about 150TB data. It took about 70 hours:

# dupd scan --path /path1 --path /path2
Files:  2420698                           0 errors                       1354 s

Total duplicates: 2108486 files in 690968 groups in    238110 s
Run 'dupd report' to list duplicates

Then, I did:

cd /path2
dupd uniques

dupd started listing the files which are unique to /path2, but it is taking a very long time, with CPU tagged at about 50%.

Is this normal? I thought that since the files have been listed in a SQLite db, print such list would have been fast?

jvirkki commented 1 year ago

Nice to hear it has worked satisfactorily on a 150TB data set. The largest I've run it on was about 4TB.

Are you running a development branch build or a released version?

The SQLite db only stores known duplicates, it does not store info about files which were unique at the time of the scan. So when you run dupd uniques it needs to walk the filesystem to look for files it doesn't know about.

I have not looked at the uniques implementation detail in a long time so don't remember all the details, it may have some opportunities for improvement. I'll review it to see.

lispstudent commented 1 year ago

The use-case: a group of students in humanities need to work on several collections of digital document: audio, image, text and video. Some collections are copies of each other, some are not.

They need to assign taxonomies of their choice. They can add files to the collections, rename or move files and directories according to their own chosen taxonomy principles.

In the end, the collections are "frozen", no more changes. Then we need to ascertain which taxonomy choice better served the initial given criteria guideline.

Often, a collection will be almost 100% a duplicate of another.

Then, we need to copy the uniques of a collection, i.e. path2, to a common pool.

That is why dupd uniques would be so useful.

If it may be of any use:

# ls -l /root/.dupd_sqlite
-rw-r--r--  1 root  wheel  263409664 May 18 08:14 /root/.dupd_sqlite

I initially thought that the SQLite database was holding all entries, so that such computation, in case of frozen collections, could be done without rescanning.

Thank you for considering this!

lispstudent commented 1 year ago

Are you running a development branch build or a released version?

We need to use our university server, which is running FreeBSD 13.2. Latest release of dupd is installed from ports: https://www.freshports.org/sysutils/dupd/

If necessary, I can ask the sysadmin to install a development version.

Thank you again.

lispstudent commented 1 year ago

I am keeping dups uniques running since time of report.

It seems also ls command is slow:

dupd ls --path /path2

WARNING: database is 113 hours old, may be stale!

(The stale warning in this case should be ignored, as I know no folder has been changed.)

jvirkki commented 1 year ago

The ls, uniques and dups are largely the same code, just varies what gets shown in the output. So not surprising the runtimes are similar. There might be room for improvement there.

Yes, you can ignore the database staleness warning if you know the data set is static.

I would not recomment updating the dupd build on the server to the development version as that one is a work in progress that may or may not work. The development version (what will eventually become the 2.0 release) will reduce memory consumption by quite a bit in many scenarios but if you're not running out of RAM in your 150TB data set you should be good with the latest release version.

I assume the files are all very large, if the data set only contains 2.4M files but total size is 150TB.

jvirkki commented 1 year ago

I had forgotten that the 1.7 release has the --uniques scan flag, so could try it.

dupd scan --uniques (any other options)

This will cause it to save all unique files in the db as well. Given your use case where nearly all files are duplicates, this may be useful.

lispstudent commented 1 year ago

I would not recommend updating the dupd build on the server to the development version as that one is a work in progress that may or may not work. The development version (what will eventually become the 2.0 release) will reduce memory consumption by quite a bit in many scenarios but if you're not running out of RAM in your 150TB data set you should be good with the latest release version.

Thank you, I will not use the development version then, and wait for release 2.0.

The server is a Dell PowerEdge with 128GB Ram, and I did not notice excessive Ram usage.

I assume the files are all very large, if the data set only contains 2.4M files but total size is 150TB.

Yes, most files are large, being archival-quality digitisation of film and audio reels.

I had forgotten that the 1.7 release has the --uniques scan flag, so could try it.

Thank you, that is very good to know. I will wait for the current dups uniques to complete (now it is on its 3rd day), and then rescan the whole pool with --uniques scan flag, and report back.

Thank you again!

jvirkki commented 1 year ago

I just made a 1.7.2 release which makes the dupd uniques operation faster if (only if) the scan was done with the --uniques option.

Since I just made the release, it won't be available in the FreeBSD ports until someday. However, you can pretty much replicate the behavior with the previous release by first running the scan with the --uniques option and then dumping the list of unique files straight from the database:

echo 'select path from files' | sqlite3 /root/.dupd_sqlite

(substitue /root/.dupd_sqlite for the location of the db, if it is elsewhere)

jvirkki commented 1 year ago

The reason these listing operations are slow is that they do an SQLite query for every file and that's just slow.

For the case of uniques, it is easiest to work around by simply showing the list of previously identified unique files as-is. As long as no new files have been added since the scan, that should be the list.

The dups operation is more complex because it needs to validate whether a duplicate is still a duplicate, which will require a db query so it can't be skipped. And the ls operation is just dups+uniques so it also needs to do that. These will be more tricky to make more efficient.

I will leave this ticket open for 2.0 to consider ways to improve these.

lispstudent commented 1 year ago

Thank you for releasing version 1.7.2, as soon as I see it of FreshPorts I will have it installed, redo the scan and report back. By that time we shall probably add a major extra collection and reach 180TB, so it will be interesting to see.

Thank you also for the details on SQLite operations, very instructive for me. Is this somehow related to it?

Sometimes I wish I would be in Programming and not humanities. Fascinating field!

Thank you very much for all your help. Much appreciated!

jvirkki commented 1 year ago

It may take a long while to show up in ports (not sure), but as noted you can get the same outcome with the current version, just a bit more cumbersome. So no need to wait. Re-run the scan with --uniques and then run the command shown above to print the list of unique files.

(I know your system has the sqlite3 libraries installed, because dupd requires them, but it may or may not have the sqlite3 command installed. If it doesn't, ask for it to be installed in order to run the query to export the list of files.)

(This is unrelated to the SQLite DISTINCT link you mentioned. dupd just stores a list of all unique files in the database but only when --uniques option is specified. It is not the default behavior because for most operations it doesn't really help and just takes more space. But for your specific use case, it is very useful.)

lispstudent commented 1 year ago

Thank you for the heads up.

I have just launched dupd scan --uniques --path /path1 --path /path2 --path /path3

I will report back when it is done.

Am I correct in assuming this will store only uniques paths, and not duplicates?

jvirkki commented 1 year ago

Duplicates are always stored in the db, so this run will store the pathnames of both duplicates and unique files.

lispstudent commented 1 year ago

Understood, thanks.

dupd is a wonderful tool. It is not just the speed. I tried (almost) all of them, dupd workflow is so well thought-out. Truly a joy to use. Thank you again for creating it.

jvirkki commented 1 year ago

Thanks! I built it for my workflow since the existing tools didn't really match how I wanted to work. I'm glad it's useful to others.

lispstudent commented 1 year ago

Interim report:

# dupd scan --uniques --path /path1 --path /path2 --path /path3

Files:  3814032                           0 errors                       2886 s
Sets :   208213/  594333 3371408527K (  66639K/s)    0q  98%B 30093f    243943 s

I will be traveling over the week-end for a seminar. I will report back on Tuesday, hopefully it will be completed by then.

lispstudent commented 1 year ago

Sorry for the delay. The scan finished a few days ago, but I wanted to make sure of some peculiarities.

Doing it with 3 paths took about 144 hours:

# dupd scan --uniques --path /path1 --path /path2 --path /path3                                                                                                      
Files:  3814032                           0 errors                       2886 s

Total duplicates: 3487024 files in 879858 groups in    519447 s
Run 'dupd report' to list duplicates.

The --uniques flag increased the SQLite db:

# ls -hal /root/.dupd_sqlite 
-rw-r--r--  1 root  wheel   454M May 28 17:16 /root/.dupd_sqlite

There are 323429 uniques in all 3 paths.

echo 'select path from files' | sqlite3 /root/.dupd_sqlite | wc -l
  323429

That is not much useful, since we need to know uniques only from 1 path at a time.

/path1 # echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2" | wc -l
  195853

So, I could now move those 195853 files to a new directory.

BUT:

If I run the former command (top of this issue),

cd /path2
dupd uniques

It has printed already a few files that are not in the list above.

So, now I am unsure in what to do next?

Shall I simply follow the normal way, to remove all files from /path2 which are duplicates?

Is there a way to debug this issue?

jvirkki commented 1 year ago

That seems odd, hard to tell without seeing the files and paths.

Some thoughts...

I know you said the file set is static, but any chance new files were added after the scan started? Since dupd uniques is walking the file tree as it runs, it would notice any new files and since it doesn't know about them (not in the db), show them as uniques. Check timestamps on these files that are getting shown that you didn't expect.

Since the scan was run with --uniques, the db should contain an entry for every file in the scanned filesystems (except for new files added after scan started). Try this:

find /path1 -type f -size +0 | grep -v '/\.' > /tmp/found_by_find find /path2 -type f -size +0 | grep -v '/\.' >> /tmp/found_by_find find /path3 -type f -size +0 | grep -v '/\.' >> /tmp/found_by_find

Now /tmp/found_by_find should have all the files scanned by dupd (I excluded zero-sized files and hidden files since dupd skips those). So there should be 3814032 file paths there.

Now compare with the set of paths in the db:

dupd report | grep ' /' | awk '{print $1}' > /tmp/dupd_duplicates echo 'select path from files' | sqlite3 /root/.dupd_sqlite > /tmp/dupd_uniques cat /tmp/dupd_duplicates /tmp/dupd_uniques > /tmp/found_by_dupd

So now /tmp/found_by_dupd should have the same 3814032 file paths as in /tmp/found_by_find

If not, try to identify what's different about the files missing in one or the other list.

lispstudent commented 1 year ago

Thank you for taking so much time to assist. Your thoughts, suggestions and command line examples are much appreciated.

That seems odd, hard to tell without seeing the files and paths.

I thought of sharing the SQLite db and paths, but they contains personal data which belongs to others.

I know you said the file set is static, but any chance new files were added after the scan started? Since dupd uniques is walking the file tree as it runs, it would notice any new files and since it doesn't know about them (not in the db), show them as uniques. Check timestamps on these files that are getting shown that you didn't expect.

/path2 and /path3 are static, forever frozen. /path1 is in the process of being added to.

That is why the number 3814032 will not match precisely using those find commands.

# cat /tmp/found_by_dupd | wc -l
 3810453

3814032 − 3810453 = 3579

Something like,

/fonds/picture/main/1990-2007 Mark/x Mark (to be sorted) 2015/Masters from old iphoto/2010/Apr 9, 2010/Scan-1-100409-0002.jpg

Would that be relevant?

jvirkki commented 1 year ago

The truncated filenames with spaces in /tmp/found_by_dupd is just due to the awk command picking the first string bounded by spaces, so it's not a dupd issue. I put that in the example above to remove the leading whitespace but didn't think of files with spaces in them. The full paths should be in the db correctly. So run this instead:

dupd report | grep ' /' > /tmp/dupd_duplicates

With that correction, try to identify what is different about the files not included in /tmp/found_by_dupd

I'm curious to find out what might be the cause.

lispstudent commented 1 year ago

Sorry for the delay. I was quite confused by the remark above,

I excluded zero-sized files and hidden files since dupd skips those

I could not figure it out why some files are reported as uniques by dupd uniques but were not listed in,

echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2".

It turns out many of the files reported by the "slow" method,

cd /path2 
dupd uniques

are empty.

Could it be that dupd uniques has a slightly different behaviour than,

echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2"

so that the former method consider unique also zero-length files?

jvirkki commented 1 year ago

Yes, that's it! Thanks for identifying it.

I filed a separate bug #47 for this.

dupd scan ignores zero-length files, so they will not be included in the db. My find command in the discussion above mimics this behavior with the -size +0 option for consistency with what's in the db.

However, dupd uniques does show empty files in the output (see sample run in #47) which is inconsitent with how scan and report work.

lispstudent commented 1 year ago

Excellent, thank you for explaining it.

As a side note, to our admittedly very unusual corner case, filenames are considered integral part of a file, not some "metadata" to its content.

So, for us, files with same size and different filenames are to be considered different. I know, this clashes with Unix Weltanschauung. Now that I know of this aspect, I can always treat zero-length files with a separate script.

Thank you again so much for considering this and the tremendous help!