jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

--output-format sam discards all SAM tags !? #103

Closed plijnzaad closed 4 years ago

plijnzaad commented 4 years ago

When running atropos on (unaligned) SAM input which contains SAM tags and outputting with SAM format, all SAM tags have disappeared ! This is very unfortunate because the only reason, really, to use unaligned SAM instead of fastq is precisely the ability to maintain extra information per read (which is impossible with FASTQ).

The specific use case here is single-cell RNA sequencing where we use the CR:Z CY:Z CB:Z UR:Z UY:Z tags to store cell-of-origin and UMI information.

Can the tag please not be skipped? Many thanks!

Philip

plijnzaad commented 4 years ago

I work-around would be as follows:

# extract selected tags (could be fewer or more) from the bamfile:
tagsfile=${bamfile/sam/tags}
samtools view $bamfile \
  | awk -F "\t" -v OFS="\t"  '{print $12,$13,$14,$15,$16,$17,$18 }'  > $tagsfile

# run atropos with bam in- and output, adding back the tags from $tagsfile
   atropos trim --adapter $adapter $bamfile \
                --input-format bam --output-format sam OTHERARGS  \
   | grep -v '^@' \
   | paste -d "\t" - $tagsfile \
   | awk -F"\t" -v OFS="\t" '$10 != "" && $10 != "*" ' \
   | samtools view -b - > ${bamfile/.bam/-trimmed.bam}

the grep -v-line is to address issue #101, the second awk line is to get rid of reads that have been trimmed to length 0.

jdidion commented 4 years ago

Fixed in develop. Will be released in alpha6.

There are also new --remove-sam-tags and --keep-sam-tag options for filtering out all/specific SAM tags.