Open magicDGS opened 6 years ago
@robmaz - lets discuss here the requirements for this integration with distmap instead of in independent issues. We can re-open or create new issues with the required simple components later, once we take the decision on te design.
@robmaz would like to integrate the capabilities of ReadTools for any kind of supported format (FASTQ/SAM/BAM/CRAM by now) into the distmap pipeline in a better way. The current pipeline is the following (all called within the distmap software):
ReadTools ReadsToDistmap
: upload (with optional trimming) the reads to HDFS into the compact and splitable distmap-format. The current implementation only keeps the barcodes in the read name, but if barcode de-multiplexing has been already performed, keeping the read groups (@RG
) is desirable to mark the reads properly. One suggestion is to dump the header with the@RG
to use later on download (see below and #510), but this will bring problems if multiple read groups are present as reads cannot be re-assigned without the full de-multiplexing run.@RG
header lines) - this is one of the limiting factors out of our control.ReadTools DownloadDistmapResult
: downloads from HDFS and merge the part files (SAM/BAM) into a combined file on the local path. It will be nice to provide a SAM header with read groups (or a master SAM header with more information) to be merged with the ones downloaded from the distmap run (requested in #511), but it is not trivial as it should have specific rules and requires to re-assign read groups each read (as in the first step).To make posible to roundtrip reads->distmap->reads and keep the read group information from the original reads, there are several propositions under discussion:
@RG
but not download them if we want to retrieve the information. This is the option that requires the minimal efford, as it will just fail for multiple@RG
and assign the single one otherwise. Still, it will need to set some rules to merge the rest of header fields (unless the@RG
is the only header lines allowed, appart of the version one).@RG
is present (@{{read_name}}#{{barcode_seq}}
) or read-group id/index (@{{read_name}}#{{rg_id}}
or@{{read_name}}#{{rg_idx}}
), which can be parsed afterwards. Some complications might arrise from this: 1) always required to use the same version of ReadTools for upload/download; 2) unsupported@RG
handling for legacy distmap format; 3) requirement for header while downloading if ID/idx was used; 4) lost of raw-barcode information if only-RG is handled. Nevertheless, this was just a first draft and can be modified to address this issues and discussed with @robmazI think that a quick implementation for option 1 is good to have this support to some extend, with a warning on upload and an error on download for more than 1 RG in the header file (saying that this limitation might be removed in the future) and then evolve the new format for distmap (#404) to contain information for the read group and maybe some arbitrary information. Another option is to change distmap to use the map-reduce code from Hadoop-BAM to split the input file, and remove completely the need of the distmap custom format.