richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

[Call for comment] Change default -multi setting #106

Closed richardlehane closed 6 years ago

richardlehane commented 6 years ago

The -multi flag puts sf into a parallel mode where it scans more than one file at once (up to the limit set by the flag e.g. -multi 16). The default value is 1.

Using the -multi flag can give a nice speed boost if your HDD is an SSD (e.g. if I use -multi 256 I get a 3X speed boost on my laptop). Importantly, this speed boost is free: there are no impacts on the order or quality of results. But using the flag potentially means more processor and memory load and can actually cause marginal slow downs if you have a slow PC with a spinning HDD. Furthermore the flag doesn't stack with unzipping (-z flag) - using the -z flag automatically drops the multi value to 1.

Qs: should the default be changed? Would a majority of users benefit? If changed, what is the optimum setting to increase speed but not impact users with slower PCs? E.g. suggest one of 256, 128, 64, 32, 16, 8, 4, 2?

I'm not going to rush a change on this one & I'm very interested to hear from users willing to test different -multi settings on their machines.

richardlehane commented 6 years ago

A recommended process for testing -multi flag

Sample command

The command I'm using to test -multi flag is:

sf -multi 16 -log t,o DIR

The -log t,o asks sf to log the time taken to stdout. Because I'm logging to stdout the results (which we don't care about) are omitted.

Avoid caching effects

If you keep running sf against the same directory in sequence you'll probably notice some caching effects on speed (i.e. subsequent runs are a lot faster because your file system is caching data and re-using). In order to avoid these effects, suggest doing one of the following between runs:

If you're really unsure you can also power cycle your computer between runs!

Documenting your results

Please share information about each run (what command flags you used e.g. whether you used the -hash flag; what multi setting you used; and the time taken), about your system (operating system, type of HDD, amount of RAM, CPU), about your version of siegfried (including what PRONOM sigs you are using); and about the corpus you are testing (e.g. number of files, total volume). For extra points you can upload your results file to the new charting tool.

fozboz commented 6 years ago

i7-3770 CPU @ 3.40GHz 8GB RAM 5400 RPM spindle disk

I have a test set of data which is 46400 files, 13GB.

I'm running sf -log t -json -hash md5 data > sf-test.json

We're comparing Siegfried to running hashdeep followed by file. Siegfried blows them out the water. It took hashdeep & file 14 minutes, Siegfried, with no -multi flag, took 3m58s.

Using different -multi options, it was only faster using 2.

-multi 256 = 7m5s (and crippled my machine!) -multi 128 = 6m26s -multi 64 = 5m39 multi 32 = 5m13s multi 16 = 4m36s -multi 8 = 4m31s -multi 4 = 3m58s -multi 2 = 3m25s

On a VM with 8CPUs, 8GB RAM and RAIDed SSDs on the host, same dataset, the results were very different.

No multi = 6m5s

-multi 256 = 32s -multi 128 = 32s -multi 64 = 32s multi 32 = 32s multi 16 = 33s -multi 8 = 38s -multi 4 = 1m8s -multi 2 = 1m57s

Seems 32s is my maximum!

richardlehane commented 6 years ago

Thanks very much for this data @fozboz (& you're the first! - I've been remiss in not yet posting any results myself yet but hope to do so soon)

Interesting things:

Another option may be to leave the -multi default as it is but allow for some kind of configuration file where users could set their own defaults (and potentially change other defaults too like output format). This might be an sf.conf file in the user's sf home directory.

fozboz commented 6 years ago

With the configuration file, I'm assuming that's when running Siegfried in server mode? I haven't explored that much yet but that's a must-have for us. I will have a look and contribute the change if I can find time.

At some point we plan to start using Siegfried as a local web microservice and will test different settings running from spindles, SSDs, NFS and Swift.

What are the default multi-threading options? It looked to me like it was spawning two parent threads and then one thread-per-core. How is multi-threading implemented? Is it one thread per file?

richardlehane commented 6 years ago

for the configuration file, if implemented I'd expect it would work both in server and non-server mode. At the moment all the configuration needs to be passed in as command line flags (e.g. -json, -multi x, -log x,y,z etc.). A configuration file would just let you store those flag options in a file to save typing each time. I've hesitated implementing as users who want this can just do a bash alias anyway - this would really just be an extra convenience. I expect it would be especially useful for the -multi flag as that's really a flag you'd just want to sent once for a particular PC or server i.e. it isn't the type of option you swap and change.

re. default multi-threading options: the go runtime manages all that transparently. The -multi flag doesn't operate at that level, it just sets an upper limit on the number of goroutines (lightweight threads - https://blog.nindalf.com/posts/how-goroutines-work/) that will spawn during a file walk. The default is currently one. Even if multi is set to one, sf will still spawn a small number of other goroutines to do other bits of work & the go runtime may parallelize those i.e. having -multi set to one won't make sf run in a single thread necessarily.

I don't think too many people are running sf in server mode so you may uncover bugs... please don't hesitate to let me know if anything goes wrong!

richardlehane commented 6 years ago

I've implemented a first go at a configuration file on the develop branch: https://github.com/richardlehane/siegfried/tree/develop

The new configuration system works like this:

This should make using -multi much easier for users. E.g. in example below I set my default to 16 (tripling my speed over the existing default of 1).

image

richardlehane commented 6 years ago

the latest release (v1.7.9) includes a configuration feature which I hope will cause users to set sensible -multi defaults appropriate for their environments. The config feature is described in this post: https://www.itforarchivists.com/post/sf179/

Closing this issue as I will leave the default at -multi 1 and let users set their own custom default.