magicDGS / ReadTools

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
https://magicdgs.github.io/ReadTools/
MIT License
6 stars 3 forks source link

ReadsToDistmap: dump sam header somewhere (feature request) #510

Closed robmaz closed 5 years ago

robmaz commented 5 years ago

Following up on this idea of working with a bam pipeline and uploading the demultiplexed bams, it would be super useful if you could also dump the SAM header in a user-specified location.

magicDGS commented 5 years ago

Thanks for the proposal - that's a really easy feature to include: just requires a new optional boolean argument (default: false) and before starting processing output the formatted header into the output path. Do you want to give it a try to start your first contribution, @robmaz?

By the way, what do you mean with the bam-pipeline?

robmaz commented 5 years ago

What I have in mind specifically is that people would demultiplex the raw data with readtools already using the various options to assign maximally informative readgroups, and then directly submit the generated bams to distmap (I believe this also was a long-standing plan of yours). The current problem is that the conversion to the tabbed pseudo-fastq in between loses the carefully constructed header and RG info. Saving the header and then merging it back in on download would solve that problem. (The GATK pipeline suggests a similar procedure, as you probably know.)

magicDGS commented 5 years ago

I can find some issues arising from this:

  1. If a multi-RG SAM/BAM file is converted to a pseudo-fastq, assigning the @RG requires to assign the barcode again.
  2. The distmap output is a SAM/BAM file with it's own header, and thus download something that might have already a @RG header might generate conflicts (see my concerns in https://github.com/magicDGS/ReadTools/issues/511#issuecomment-415497816).

I think that a way to make this easier and happening in Distmap will be to discuss a new format (e.g., the tfq from #404) where we can construct a distmap-specific read name containing the information required (e.g., @{{read_name}}#1:{{barcode_seq}} will indicate the read group index from the header) or have two options to avoid barcode/read-group lost (@{{read_name}}#{barcode_req}} if no read-group is present like in a FASTQ file, and @{{read_name}}#{{rg_id}} for cases where read-group is present, but it will lose the BC tags). We should probably have a different issue for discussing this distmap-pipeline integration, as different issues open without any connection will make this impossible to discuss in an organize way.

magicDGS commented 5 years ago

Closing in favor of #518