Processing Performance - Githubissues

GoogleCodeExporter commented 8 years ago

I don't have a specific bug to report, but I was wondering if any performance 
profiling has been performed on the FITS tool to identify bottlenecks we might 
be able to eliminate.  In my experiences so far, FITSifying a file takes on avg 
about 7 secs.  On a 30,000 collection of files this means it would take approx 
(30,000 x 7) / 3600 = 58 hrs to process all files.  30k files is a fairly small 
number for large datasets that a tool like FITS would benefit, i.e. large 
scientific/cultural/historical datasets that need to be preserved for the long 
term.  Perhaps there are some trivial performance optimizations already 
available of which I am unaware.

Thanks

DW

Original issue reported on code.google.com by david.wa...@gmail.com on 4 Apr 2011 at 9:03

GoogleCodeExporter commented 8 years ago

I don't have any specific performance metrics, but 7 seconds does seem like a 
long time.  The speed of FITS can vary based types of files that are being 
analyzed and which tools are being invoked.  If you know certain tools are not 
useful for the files that you are processing you can disable them, either 
through the API or by commenting them out of the fits.xml configuration file.  
In my experience Jhove and the NLNZ Metadata Extractor usually take the longest 
amount of time and read in much more data from the file than the other tools.

Original comment by spencer_...@harvard.edu on 5 Apr 2011 at 2:47

GoogleCodeExporter commented 8 years ago

I am also interested in this issue.
I haven't looked a lot through the code, so I am sorry if my questions seem 
obvious to you but could you please elaborate more on the following:

As far as I know Jhove needs a lot of time for its configuration but as soon it 
is done the process of extracting data from different files does not take so 
long. So may be if there was a way to invoke fits on a set of files instead of 
single file it will increase performance.

Thanks in advance!
P.

Original comment by PePet...@gmail.com on 5 Apr 2011 at 4:14

GoogleCodeExporter commented 8 years ago

That's a good point.  If you are using the FITS command line to process files 
one at a time, it's going to result in each tool being initialized for ever 
file that is processed.  If you're using the Java API the tools get initialized 
once and then you can pass it as many files as you want processed.

A feature to call fits against a directory of files, without the 
re-initialization happening would be a good enhancement.

Original comment by spencer_...@harvard.edu on 5 Apr 2011 at 5:27

GoogleCodeExporter commented 8 years ago

Hello from the archivematica project.

We've noticed this potential improvement as well. 
Something I'd like to point out, is there are some similarities with clamscan. 
Each time it's initialized, it load's its rule set. Scanning individual files 
takes a long time with clamscan. Their solution, which I like, was to create a 
daemon (clamdscan) that acted as a local server to scan the files, and holds 
the rules in memory. The command line call to clamdscan uses the same 
parameters as clamscan, and sends the request to the daemon. This way, the user 
see's very little change in their implementation.

Original comment by josephPe...@gmail.com on 21 Jul 2011 at 8:26

GoogleCodeExporter commented 8 years ago

FITS now includes a -r option to recursively process directories of files.  
Each tool is also invoked in a separate thread which has improved performance.

Original comment by spencer_...@harvard.edu on 25 Apr 2012 at 1:15

tjt263 / fits

Processing Performance #20