Closed richardlehane closed 6 years ago
The command I'm using to test -multi flag is:
sf -multi 16 -log t,o DIR
The -log t,o
asks sf to log the time taken to stdout. Because I'm logging to stdout the results (which we don't care about) are omitted.
If you keep running sf against the same directory in sequence you'll probably notice some caching effects on speed (i.e. subsequent runs are a lot faster because your file system is caching data and re-using). In order to avoid these effects, suggest doing one of the following between runs:
sudo purge
sudo sync && echo 3 > /proc/sys/vm/drop_caches
If you're really unsure you can also power cycle your computer between runs!
Please share information about each run (what command flags you used e.g. whether you used the -hash flag; what multi setting you used; and the time taken), about your system (operating system, type of HDD, amount of RAM, CPU), about your version of siegfried (including what PRONOM sigs you are using); and about the corpus you are testing (e.g. number of files, total volume). For extra points you can upload your results file to the new charting tool.
i7-3770 CPU @ 3.40GHz 8GB RAM 5400 RPM spindle disk
I have a test set of data which is 46400 files, 13GB.
I'm running sf -log t -json -hash md5 data > sf-test.json
We're comparing Siegfried to running hashdeep
followed by file
. Siegfried blows them out the water. It took hashdeep & file 14 minutes, Siegfried, with no -multi
flag, took 3m58s.
Using different -multi
options, it was only faster using 2
.
-multi 256
= 7m5s (and crippled my machine!)
-multi 128
= 6m26s
-multi 64
= 5m39
multi 32
= 5m13s
multi 16
= 4m36s
-multi 8
= 4m31s
-multi 4
= 3m58s
-multi 2
= 3m25s
On a VM with 8CPUs, 8GB RAM and RAIDed SSDs on the host, same dataset, the results were very different.
No multi = 6m5s
-multi 256
= 32s
-multi 128
= 32s
-multi 64
= 32s
multi 32
= 32s
multi 16
= 33s
-multi 8
= 38s
-multi 4
= 1m8s
-multi 2
= 1m57s
Seems 32s is my maximum!
Thanks very much for this data @fozboz (& you're the first! - I've been remiss in not yet posting any results myself yet but hope to do so soon)
Interesting things:
Another option may be to leave the -multi default as it is but allow for some kind of configuration file where users could set their own defaults (and potentially change other defaults too like output format). This might be an sf.conf file in the user's sf home directory.
With the configuration file, I'm assuming that's when running Siegfried in server mode? I haven't explored that much yet but that's a must-have for us. I will have a look and contribute the change if I can find time.
At some point we plan to start using Siegfried as a local web microservice and will test different settings running from spindles, SSDs, NFS and Swift.
What are the default multi-threading options? It looked to me like it was spawning two parent threads and then one thread-per-core. How is multi-threading implemented? Is it one thread per file?
for the configuration file, if implemented I'd expect it would work both in server and non-server mode. At the moment all the configuration needs to be passed in as command line flags (e.g. -json, -multi x, -log x,y,z etc.). A configuration file would just let you store those flag options in a file to save typing each time. I've hesitated implementing as users who want this can just do a bash alias anyway - this would really just be an extra convenience. I expect it would be especially useful for the -multi flag as that's really a flag you'd just want to sent once for a particular PC or server i.e. it isn't the type of option you swap and change.
re. default multi-threading options: the go runtime manages all that transparently. The -multi flag doesn't operate at that level, it just sets an upper limit on the number of goroutines (lightweight threads - https://blog.nindalf.com/posts/how-goroutines-work/) that will spawn during a file walk. The default is currently one. Even if multi is set to one, sf will still spawn a small number of other goroutines to do other bits of work & the go runtime may parallelize those i.e. having -multi set to one won't make sf run in a single thread necessarily.
I don't think too many people are running sf in server mode so you may uncover bugs... please don't hesitate to let me know if anything goes wrong!
I've implemented a first go at a configuration file on the develop branch: https://github.com/richardlehane/siegfried/tree/develop
The new configuration system works like this:
-setconf
flag then sf saves any flags used to a config file (e.g. sf -csv -multi 64 -log warn,error,chart,progress -setconf
) -conf
flag to set the config file/path name (e.g. sf -multi 64 -serve localhost:5153 -conf serve.conf -setconf
and then sf -conf serve.conf
)This should make using -multi
much easier for users. E.g. in example below I set my default to 16 (tripling my speed over the existing default of 1).
the latest release (v1.7.9) includes a configuration feature which I hope will cause users to set sensible -multi defaults appropriate for their environments. The config feature is described in this post: https://www.itforarchivists.com/post/sf179/
Closing this issue as I will leave the default at -multi 1 and let users set their own custom default.
The -multi flag puts sf into a parallel mode where it scans more than one file at once (up to the limit set by the flag e.g.
-multi 16
). The default value is 1.Using the -multi flag can give a nice speed boost if your HDD is an SSD (e.g. if I use -multi 256 I get a 3X speed boost on my laptop). Importantly, this speed boost is free: there are no impacts on the order or quality of results. But using the flag potentially means more processor and memory load and can actually cause marginal slow downs if you have a slow PC with a spinning HDD. Furthermore the flag doesn't stack with unzipping (-z flag) - using the -z flag automatically drops the multi value to 1.
Qs: should the default be changed? Would a majority of users benefit? If changed, what is the optimum setting to increase speed but not impact users with slower PCs? E.g. suggest one of 256, 128, 64, 32, 16, 8, 4, 2?
I'm not going to rush a change on this one & I'm very interested to hear from users willing to test different -multi settings on their machines.