google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

Support FOFN original subread input, document CLI #12

Closed maximilianpress closed 2 years ago

maximilianpress commented 2 years ago

Behavior I expected

Accept FOFN input for subreads (--input_subreads_unaligned=subreads.fofn).

Very frequently PacBio subread data is spread across directories and files according to flow cell and run architecture. This is the standard format in which PacBio reads are delivered to customers by service providers. The solution to this is the FOFN format (see for example PBMM2 documentation).

Behavior I observed

Command-line arguments are not documented. Unclear expected input of --input_subreads_unaligned; however instead I received an error that the FOFN file did not have a SAM header. Example workflow does show BAM input but does not otherwise describe inputs.

Reprocessing, merging, and housekeeping related to transformations on very large BAM files is a notable overhead and makes deepconsensus less useful.

Background

I am working with a rather large dataset (several TB) that involves combining across multiple PB BAM files from different flow cells. Therefore I have used the commonly used FOFN (file of file names) format as input to the PBMM2 step. Accepting FOFN is standard for PB tools.

I got to the deepconsensus step itself, however, before observing that BAM only appears to be supported for the unaligned input subreads.

I am currently using a workaround of pbmerge from the PB toolkit to prepare a single unmapped BAM file from my subreads. This single BAM can then presumably be passed to deepconsensus.

What would help

I suggest some options for addressing this issue, at various levels of effort:

MariaNattestad commented 2 years ago

Thank you for this feature request.

DeepConsensus is still at a proof-of-concept stage, but we are working on making it more scalable and easy to use outside of Google's internal infrastructure. I'll log this as a feature request in the meantime. Thanks!

pichuan commented 2 years ago

Hi @maximilianpress , We've made a release in January that cleaned up a code quite a bit, and we also no longer use pbmm2. I'm going to close this issue now. But feel free to open another one if you encounter any issues with the latest release.