magicDGS / ReadTools

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
https://magicdgs.github.io/ReadTools/
MIT License
6 stars 3 forks source link

Better support for read groups in distmap integration #518

Open magicDGS opened 6 years ago

magicDGS commented 6 years ago

@robmaz would like to integrate the capabilities of ReadTools for any kind of supported format (FASTQ/SAM/BAM/CRAM by now) into the distmap pipeline in a better way. The current pipeline is the following (all called within the distmap software):

To make posible to roundtrip reads->distmap->reads and keep the read group information from the original reads, there are several propositions under discussion:

  1. Only allow one read group on download (suggested here: https://github.com/magicDGS/ReadTools/issues/511#issuecomment-415712396) and fail otherwise. This can be weird, because we allow to upload/transform reads from multiple @RG but not download them if we want to retrieve the information. This is the option that requires the minimal efford, as it will just fail for multiple @RG and assign the single one otherwise. Still, it will need to set some rules to merge the rest of header fields (unless the @RG is the only header lines allowed, appart of the version one).
  2. Integrate a new distmap-format which supports adding barcodes to the read name if no @RG is present (@{{read_name}}#{{barcode_seq}}) or read-group id/index (@{{read_name}}#{{rg_id}} or @{{read_name}}#{{rg_idx}}), which can be parsed afterwards. Some complications might arrise from this: 1) always required to use the same version of ReadTools for upload/download; 2) unsupported @RG handling for legacy distmap format; 3) requirement for header while downloading if ID/idx was used; 4) lost of raw-barcode information if only-RG is handled. Nevertheless, this was just a first draft and can be modified to address this issues and discussed with @robmaz

I think that a quick implementation for option 1 is good to have this support to some extend, with a warning on upload and an error on download for more than 1 RG in the header file (saying that this limitation might be removed in the future) and then evolve the new format for distmap (#404) to contain information for the read group and maybe some arbitrary information. Another option is to change distmap to use the map-reduce code from Hadoop-BAM to split the input file, and remove completely the need of the distmap custom format.

magicDGS commented 6 years ago

@robmaz - lets discuss here the requirements for this integration with distmap instead of in independent issues. We can re-open or create new issues with the required simple components later, once we take the decision on te design.