hubmapconsortium / ingest-validation-tests

0 stars 0 forks source link

Plugin optimization #48

Closed gesinaphillips closed 9 months ago

gesinaphillips commented 9 months ago

Validator parent class (in IVT) accepts a list of data_paths rather than a single path at a time. Files are then collected and processed in parallel in various plugin subclasses. Testing indicates that this cuts processing time of fastq files in half (at least for a <10GB upload with fastq files). EDIT: or maybe way less!

CODEX and Publication plugins accept data_path lists but are not currently parallelized, but can be updated to be. Testing for these plugins is already quite fast so I focused on large-file plugins (fasq, gz, ome.tiff, tiff).

Parallelizing made redundancy checking in fastq_validator_logic unreliable. I moved this logic outside of the Engine call that processes files in parallel.

Tested:

Note