pod5 subset/filter in preparation for dorado duplex is slow

wilsonte-umich commented 7 months ago

Others issues have explored this but nothing addressed my issue of performance/speed. I'm trying to follow these suggestions: https://github.com/nanoporetech/dorado?tab=readme-ov-file#improving-the-speed-of-duplex-basecalling to ensure that all reads from a channel are present in the same POD5 file prior to running dorado duplex in batch mode.

The pod5 view step works well and quickly (perhaps a few minutes?). However, I tried a couple of times to run pod5 subset and jobs were going to take 2 days (or more?) to run on a dataset from one Promethion flow cell.

I saw in another issue the suggestion to "use the simpler filter tool and repeat the process per-channel", given resource usage by subset. I set that up in my batch processing pipeline, but it is still frustratingly slow.

Here's a snippet of the progress - my code took all the read ids from 50 of the >2K channels as a filter group.

Parsed 92550 reads_ids from: channelGroup.IDS
Found 8669416 read_ids from 2168 inputs
Calculated 92550 transfers
Filtering:   6%|5         | 5414/92550 [14:37<3:55:16,  6.17Read/s]

If I read it correctly, the rate is in individual reads per second, and this one part of the total dataset is going to take multiple hours, again, too slow to be practical.

Is this expected behavior for subsetting/filtering, or due to some variable in my environment? I'm working on a shared compute cluster on a node with plenty of CPU and RAM, but the POD5 files are on a shared drive (/scratch). If the suggestion is to move all the files to a faster drive, that is tricky because I don't have a lot of local/SSD options available capable of taking the entire set of files at once - which is essential if the point is to extract all reads from the same channel from all of the input time-series POD5 files.

Input appreciated, with current performance I'm going to have to skip the subsetting by channel, even though it obviously makes sense to do.

HalfPhoton commented 7 months ago

Hi @wilsonte-umich, The performance of POD5 subset/filter is affected severely by the network file storage as well as the very large number of input files. The number of files causes issues because we need to be quite conservative with the number of file handles we have open at any one time due to the limits set by various operating systems. Also because pod5 filter only writes one output we get worse performance than subset overall as there is less parallelism but it does use fewer resources.

You will get better performance from running subset on local file storage as pod5 tools are generally IO bound. If you cannot store all of the data at once then my recommendation would be to run the subset in batches of (~500 files?) and run subsetting by channel. Using multiple nodes will also help if you have access to that option.

These per-channel files can then be merged. Alternatively, you can pass dorado a collection of a few channels worth of data without merging as this still improves the performance of the pairing algorithm.

You can use the --template option to add information to the output filenames (docs) to save you re-naming files between batches.

pod5 subset .... --tempalte "{channel}.batch-1.pod5"

We are looking into the performance and resource consumption of filter and subset - which will improve in future releases.

Lastly - I believe MinKnow is reducing the number of pod5 files written by moving away from the legacy fast5 style of writing 4k records per file so things will improve over time.

Kind regards, Rich

wilsonte-umich commented 7 months ago

Thanks, yes, since the problem appears to be IO, I agree the solution will be to subset in batches, after moving a batch to local SSD, and merge those products when done. I will report back.

Re: MinKnow - I'll be interested what "reducing" the file number means. I agree >2K files per run is excessive, but on the flip side, a single monolithic POD5 in excess of a terabyte substantially impedes handling - our solutions benefit from batching.

Of course, subset by channel will create >2K files... code I will share later should establish channel groups (1-10, 11-20, etc.) to reduce the number of files generated by subset, since a number of places I've seen that simply having more pod5 files creates a performance hit.

FWIW, I don't consider myself resource constrained, our university cluster is very well equipped - why I have no interest in building a compute solution dedicated to nanopore. End users like me need solutions that work well in shared resource environments. Batching dorado runs has worked very well! Just need to solve this one now...

wilsonte-umich commented 7 months ago

nearly done implementing what we discussed, going well. But it raises a question I'd appreciate clarity on.

I'm creating the read-to-channel mapping by defining channel_group over a series of consecutive channels:

CHANNEL_GROUP_SIZE=50
pod5 view ${INPUT_DIR} --include "read_id, channel" --output ${SUMMARY_FILE}.tmp
awk 'BEGIN{OFS = "\t"}{
    print $0, $1 ~ /read_id/ ? "channel_group" : int($2 / '$CHANNEL_GROUP_SIZE');
}' ${SUMMARY_FILE}.tmp > ${SUMMARY_FILE}

Then calling subset against channel_group as follows:

pod5 subset ${INPUT_DIR} --summary ${SUMMARY_FILE} --columns channel_group --output ${OUTPUT_DIR}

Thus, the output files have multiple channels, but all reads from a channel are in the same file. Is there any problem with that? Or does the downstream efficiency gain in dorado duplex depend on there being just one channel in one file?

HalfPhoton commented 7 months ago

You'll get a performance improvement with your solution of batching into collections of channels yes.

The smaller the batch size the greater the improvement for the pairing algorithm but I suspect there will be diminishing returns (e.g. very little difference between 1 and 2 channels per file). Conversely there's a trade off with the time it takes for dorado to load. Many tiny files especially if you have a small dataset means loading dorado, the model and the reference, auto-batch size etc.. every time which makes it just not worth going that far.

In a similar pipeline that I wrote I found that 100 was a good balance between number of files and performance which is the default that I chose. I've seen users go up to 200 successfully with smaller datasets so exposing this parameter to users might be worthwhile.

I hope this helps

Kind regards, Rich

wilsonte-umich commented 7 months ago

Pursuant to the issue above and this one on Dorado: https://github.com/nanoporetech/dorado/issues/223

For those looking to get pod subset and dorado duplex or dorado basecall to run efficiently on an HPC cluster node, I've shared full documentation for standalone open-source scripts that execute optimized batched file transfer and analysis:

repack,sh, i.e., subset by channel
basecall.sh, simplex or duplex

nanoporetech / pod5-file-format

pod5 subset/filter in preparation for dorado duplex is slow #112