Open lispstudent opened 1 year ago
Nice to hear it has worked satisfactorily on a 150TB data set. The largest I've run it on was about 4TB.
Are you running a development branch build or a released version?
The SQLite db only stores known duplicates, it does not store info about files which were unique at the time of the scan. So when you run dupd uniques
it needs to walk the filesystem to look for files it doesn't know about.
I have not looked at the uniques
implementation detail in a long time so don't remember all the details, it may have some opportunities for improvement. I'll review it to see.
The use-case: a group of students in humanities need to work on several collections of digital document: audio, image, text and video. Some collections are copies of each other, some are not.
They need to assign taxonomies of their choice. They can add files to the collections, rename or move files and directories according to their own chosen taxonomy principles.
In the end, the collections are "frozen", no more changes. Then we need to ascertain which taxonomy choice better served the initial given criteria guideline.
Often, a collection will be almost 100% a duplicate of another.
Then, we need to copy the uniques
of a collection, i.e. path2
, to a common pool.
That is why dupd uniques
would be so useful.
If it may be of any use:
# ls -l /root/.dupd_sqlite
-rw-r--r-- 1 root wheel 263409664 May 18 08:14 /root/.dupd_sqlite
I initially thought that the SQLite database was holding all entries, so that such computation, in case of frozen collections, could be done without rescanning.
Thank you for considering this!
Are you running a development branch build or a released version?
We need to use our university server, which is running FreeBSD 13.2. Latest release of dupd
is installed from ports: https://www.freshports.org/sysutils/dupd/
If necessary, I can ask the sysadmin to install a development version.
Thank you again.
I am keeping dups uniques
running since time of report.
It seems also ls
command is slow:
dupd ls --path /path2
WARNING: database is 113 hours old, may be stale!
(The stale warning in this case should be ignored, as I know no folder has been changed.)
The ls
, uniques
and dups
are largely the same code, just varies what gets shown in the output. So not surprising the runtimes are similar. There might be room for improvement there.
Yes, you can ignore the database staleness warning if you know the data set is static.
I would not recomment updating the dupd build on the server to the development version as that one is a work in progress that may or may not work. The development version (what will eventually become the 2.0 release) will reduce memory consumption by quite a bit in many scenarios but if you're not running out of RAM in your 150TB data set you should be good with the latest release version.
I assume the files are all very large, if the data set only contains 2.4M files but total size is 150TB.
I had forgotten that the 1.7 release has the --uniques scan flag, so could try it.
dupd scan --uniques (any other options)
This will cause it to save all unique files in the db as well. Given your use case where nearly all files are duplicates, this may be useful.
I would not recommend updating the dupd build on the server to the development version as that one is a work in progress that may or may not work. The development version (what will eventually become the 2.0 release) will reduce memory consumption by quite a bit in many scenarios but if you're not running out of RAM in your 150TB data set you should be good with the latest release version.
Thank you, I will not use the development version then, and wait for release 2.0.
The server is a Dell PowerEdge with 128GB Ram, and I did not notice excessive Ram usage.
I assume the files are all very large, if the data set only contains 2.4M files but total size is 150TB.
Yes, most files are large, being archival-quality digitisation of film and audio reels.
I had forgotten that the 1.7 release has the --uniques scan flag, so could try it.
Thank you, that is very good to know. I will wait for the current dups uniques
to complete (now it is on its 3rd day), and then rescan the whole pool with --uniques
scan flag, and report back.
Thank you again!
I just made a 1.7.2 release which makes the dupd uniques
operation faster if (only if) the scan was done with the --uniques
option.
Since I just made the release, it won't be available in the FreeBSD ports until someday. However, you can pretty much replicate the behavior with the previous release by first running the scan with the --uniques
option and then dumping the list of unique files straight from the database:
echo 'select path from files' | sqlite3 /root/.dupd_sqlite
(substitue /root/.dupd_sqlite for the location of the db, if it is elsewhere)
The reason these listing operations are slow is that they do an SQLite query for every file and that's just slow.
For the case of uniques, it is easiest to work around by simply showing the list of previously identified unique files as-is. As long as no new files have been added since the scan, that should be the list.
The dups operation is more complex because it needs to validate whether a duplicate is still a duplicate, which will require a db query so it can't be skipped. And the ls operation is just dups+uniques so it also needs to do that. These will be more tricky to make more efficient.
I will leave this ticket open for 2.0 to consider ways to improve these.
Thank you for releasing version 1.7.2, as soon as I see it of FreshPorts I will have it installed, redo the scan and report back. By that time we shall probably add a major extra collection and reach 180TB, so it will be interesting to see.
Thank you also for the details on SQLite operations, very instructive for me. Is this somehow related to it?
Sometimes I wish I would be in Programming and not humanities. Fascinating field!
Thank you very much for all your help. Much appreciated!
It may take a long while to show up in ports (not sure), but as noted you can get the same outcome with the current version, just a bit more cumbersome. So no need to wait. Re-run the scan with --uniques
and then run the command shown above to print the list of unique files.
(I know your system has the sqlite3 libraries installed, because dupd requires them, but it may or may not have the sqlite3 command installed. If it doesn't, ask for it to be installed in order to run the query to export the list of files.)
(This is unrelated to the SQLite DISTINCT link you mentioned. dupd just stores a list of all unique files in the database but only when --uniques
option is specified. It is not the default behavior because for most operations it doesn't really help and just takes more space. But for your specific use case, it is very useful.)
Thank you for the heads up.
I have just launched dupd scan --uniques --path /path1 --path /path2 --path /path3
I will report back when it is done.
Am I correct in assuming this will store only uniques paths, and not duplicates?
Duplicates are always stored in the db, so this run will store the pathnames of both duplicates and unique files.
Understood, thanks.
dupd
is a wonderful tool. It is not just the speed. I tried (almost) all of them, dupd
workflow is so well thought-out. Truly a joy to use. Thank you again for creating it.
Thanks! I built it for my workflow since the existing tools didn't really match how I wanted to work. I'm glad it's useful to others.
Interim report:
# dupd scan --uniques --path /path1 --path /path2 --path /path3
Files: 3814032 0 errors 2886 s
Sets : 208213/ 594333 3371408527K ( 66639K/s) 0q 98%B 30093f 243943 s
I will be traveling over the week-end for a seminar. I will report back on Tuesday, hopefully it will be completed by then.
Sorry for the delay. The scan finished a few days ago, but I wanted to make sure of some peculiarities.
Doing it with 3 paths took about 144 hours:
# dupd scan --uniques --path /path1 --path /path2 --path /path3
Files: 3814032 0 errors 2886 s
Total duplicates: 3487024 files in 879858 groups in 519447 s
Run 'dupd report' to list duplicates.
The --uniques
flag increased the SQLite db:
# ls -hal /root/.dupd_sqlite
-rw-r--r-- 1 root wheel 454M May 28 17:16 /root/.dupd_sqlite
There are 323429 uniques in all 3 paths.
echo 'select path from files' | sqlite3 /root/.dupd_sqlite | wc -l
323429
That is not much useful, since we need to know uniques only from 1 path at a time.
/path1 # echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2" | wc -l
195853
So, I could now move those 195853 files to a new directory.
BUT:
If I run the former command (top of this issue),
cd /path2
dupd uniques
It has printed already a few files that are not in the list above.
So, now I am unsure in what to do next?
Shall I simply follow the normal way, to remove all files from /path2 which are duplicates?
Is there a way to debug this issue?
That seems odd, hard to tell without seeing the files and paths.
Some thoughts...
I know you said the file set is static, but any chance new files were added after the scan started? Since dupd uniques
is walking the file tree as it runs, it would notice any new files and since it doesn't know about them (not in the db), show them as uniques. Check timestamps on these files that are getting shown that you didn't expect.
Since the scan was run with --uniques, the db should contain an entry for every file in the scanned filesystems (except for new files added after scan started). Try this:
find /path1 -type f -size +0 | grep -v '/\.' > /tmp/found_by_find
find /path2 -type f -size +0 | grep -v '/\.' >> /tmp/found_by_find
find /path3 -type f -size +0 | grep -v '/\.' >> /tmp/found_by_find
Now /tmp/found_by_find should have all the files scanned by dupd (I excluded zero-sized files and hidden files since dupd skips those). So there should be 3814032 file paths there.
Now compare with the set of paths in the db:
dupd report | grep ' /' | awk '{print $1}' > /tmp/dupd_duplicates
echo 'select path from files' | sqlite3 /root/.dupd_sqlite > /tmp/dupd_uniques
cat /tmp/dupd_duplicates /tmp/dupd_uniques > /tmp/found_by_dupd
So now /tmp/found_by_dupd should have the same 3814032 file paths as in /tmp/found_by_find
If not, try to identify what's different about the files missing in one or the other list.
Thank you for taking so much time to assist. Your thoughts, suggestions and command line examples are much appreciated.
That seems odd, hard to tell without seeing the files and paths.
I thought of sharing the SQLite db and paths, but they contains personal data which belongs to others.
I know you said the file set is static, but any chance new files were added after the scan started? Since dupd uniques is walking the file tree as it runs, it would notice any new files and since it doesn't know about them (not in the db), show them as uniques. Check timestamps on these files that are getting shown that you didn't expect.
/path2
and /path3
are static, forever frozen. /path1
is in the process of being added to.
That is why the number 3814032
will not match precisely using those find
commands.
/tmp/found_by_dupd
contains 3579 less files than the number reported initially by dupd report:# cat /tmp/found_by_dupd | wc -l
3810453
3814032 − 3810453 = 3579
Something like,
/fonds/picture/main/1990-2007 Mark/x Mark (to be sorted) 2015/Masters from old iphoto/2010/Apr 9, 2010/Scan-1-100409-0002.jpg
Would that be relevant?
The truncated filenames with spaces in /tmp/found_by_dupd is just due to the awk command picking the first string bounded by spaces, so it's not a dupd issue. I put that in the example above to remove the leading whitespace but didn't think of files with spaces in them. The full paths should be in the db correctly. So run this instead:
dupd report | grep ' /' > /tmp/dupd_duplicates
With that correction, try to identify what is different about the files not included in /tmp/found_by_dupd
I'm curious to find out what might be the cause.
Sorry for the delay. I was quite confused by the remark above,
I excluded zero-sized files and hidden files since dupd skips those
I could not figure it out why some files are reported as uniques by dupd uniques
but were not listed in,
echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2"
.
It turns out many of the files reported by the "slow" method,
cd /path2
dupd uniques
are empty.
Could it be that dupd uniques
has a slightly different behaviour than,
echo 'select path from files' | sqlite3 /root/.dupd_sqlite | grep -i "/path2"
so that the former method consider unique also zero-length files?
Yes, that's it! Thanks for identifying it.
I filed a separate bug #47 for this.
dupd scan
ignores zero-length files, so they will not be included in the db. My find command in the discussion above mimics this behavior with the -size +0
option for consistency with what's in the db.
However, dupd uniques
does show empty files in the output (see sample run in #47) which is inconsitent with how scan and report work.
Excellent, thank you for explaining it.
As a side note, to our admittedly very unusual corner case, filenames are considered integral part of a file, not some "metadata" to its content.
So, for us, files with same size and different filenames are to be considered different. I know, this clashes with Unix Weltanschauung. Now that I know of this aspect, I can always treat zero-length files with a separate script.
Thank you again so much for considering this and the tremendous help!
Hello,
I am very glad to have found
dupd
, as it offers the best workflow for my use-case.I have run the following command, on about 150TB data. It took about 70 hours:
Then, I did:
dupd
started listing the files which are unique to /path2, but it is taking a very long time, with CPU tagged at about 50%.Is this normal? I thought that since the files have been listed in a SQLite db, print such list would have been fast?