openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

"Slow" validation of several thousands JP2 files #122

Closed sviscapi closed 5 years ago

sviscapi commented 5 years ago

Hi all,

I'm quite new to jpylyzer (v1.18) so please excuse me if the question is really stupid :)

We recently had to validate an archive comprised of ~3600 JP2 images. Each image was about 2-3 MBs. Validation was failing because some timeout value, set to 1 hour in our workflow, was reached. If my math is correct, that's ~60 images per minute. I don't know whether that's good or bad, performance-wise. The server running the application sports 8 vCPUs and 12 GBs of RAM by the way. I had a quick look at jpylyzer.py, and if I understand correctly, the checkFiles function processes files sequentially:

https://github.com/openpreserve/jpylyzer/blob/master/jpylyzer/jpylyzer.py

# Process the input files
    for path in existingFiles:

Could it be the root cause of this issue ? I.e: there were just too many files in that archive for the validation to complete in one hour ? This is not a blocking issue though, we just had to raise the timeout value for that particular archive.

Best, Samuel from CINES

https://www.cines.fr/en/

bitsgalore commented 5 years ago

Hi Samuel,

I don't have any recent performance statistics handy, but a some 5 years ago colleagues at the Danish SB did some performance tests as part of the SCAPE project. These are described at the links below:

The platform they used was a Hadoop cluster made up of 4 physical servers. Full details here:

http://wiki.opf-labs.org/display/SP/SB+Hadoop+Platform

Dividing the 20,000 files / hr result of that experiment by 4 yields 5000 files / hr, which is not too far off the 3600/hr you're getting. (Correction: they actually reached 65,000 files /hr).

Also I don't know how the data set they used compares to your situation, but 60 files per minute doesn't strike me as too bad. So yes I think the root cause here is that your timeout value is simply too restrictive.

I'm closing the issue now as this is not really a problem with jpylyzer. However feel free to add any additional questions you may have to this thread!

Cheers,

Johan

sviscapi commented 5 years ago

Hi Johan,

Thanks for your quick reply and for the links to that presentation, that's really appreciated. Actually I think our throughput was less than 60 files per minute. There were 3621 files in the archive, and the timeout was set to one hour. That's 60,35 files per minute. But we didn't make it, hence the timeout issue.

I did not monitor the server when that happened, but I think we still have the original archive handy, so maybe we could just replay the whole process in order to get more precise figures. I'll browse the logs for more information though.

Cheers,

Samuel

P.S: is jpylyzer / python capable of using the 8 cores on the server ?

bitsgalore commented 5 years ago

P.S: is jpylyzer / python capable of using the 8 cores on the server ?

Not 100% sure how Python handles this internally, but I think not. Come to think of it this probably explains the relatively poor performance you're getting. If I understand this correctly, your workflow uses 1 jpylyzer call for all 3621 files in the archive. You might want to try using separate jpylyzer calls for each individual image. If you're running your workflow in a Linux-based environment you might be able to speed up things further by using GNU Parallel (see also this article).