simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

new SAM field 'I' #1

Closed simon-anders closed 7 years ago

simon-anders commented 8 years ago

Guillermo Parada from Sanger reported on 1 Feb 2016:

[...] I writing to you to kindly inform that dexseq_count.py is currently incompatible with the SAM output of STAR (probably the most widely used mapping tool nowadays). This is the error that I get:

Traceback (most recent call last):
    File “$/dexseq_count.py", line 187, in <module>
      for a in reader( sam_file ):
    File “$/miniconda/lib/python2.7/site-packages/HTSeq-0.6.1-py2.7-linux-x86_64.egg/HTSeq/__init__.py", line 536, in __iter__
      algnt = SAM_Alignment.from_SAM_line( line )
    File "_HTSeq.pyx", line 1311, in HTSeq._HTSeq.SAM_Alignment.from_SAM_line (src/_HTSeq.c:25196)
    File "_HTSeq.pyx", line 1180, in HTSeq._HTSeq._parse_SAM_optional_field_value (src/_HTSeq.c:23178)
 ValueError: ("SAM optional field with illegal type letter ':'", 'line 12 of file Index1/STAR.worm.results/Index1.Aligned.out.sam.C_SJ’)

I figured out that this error is due to two optional field of the STAR SAM output that are not currently supported by the HTSeq library. This is how STAR SAM format looks like:

HISEQ:98:HA4PLADXX:1:1101:1160:84026    163     chrI    15064313        255     101M    =       15064403        125     GCATCGAGTATCGATGAAGAACGCAGCTTGCTGCGTTACTTACCACGAATTGCAGACGCTTAGAGTGGTGAAATTTCGAACGCATAGCACCAACTGGGCCT   BBBFFFFFFFFFFIIFIFIFFFFFFFIBFFBFFFFIIIFFFIIIIIIFFFFIFIIIFFFFFFFFFBBB7<<BBFBBFFBFFFFBFFBBFFBBFBBFFBFBB   NH:i:1  HI:i:1  AS:i:132
          nM:i:0  NM:i:0  MD:Z:101        jM:B:c,-1       jI:B:i,-1

The two last optional fields are the responsible of this incompatibility issue. Fortunately, I can continue my work by manually removing those flags (jM:B:c,-1 and jI:B:i,-1) But I thought that was nice to report this to your group as probably other users are experiencing similar difficulties. In my experience the usage of pysam library instead of HTSeq can make your script more robust to this odd optional fields.

iosonofabio commented 7 years ago

started working on this in my branch samflag_B, it should be easy at least to ignore the flag, maybe a bit more serious to treat it properly (because it introduces a whole array of numbers so some logic might change).

iosonofabio commented 7 years ago

it should work now as implemented by commits 73e3b50 (python2) and 6bb42ff84ca5520c05e30a5fd093b309aab2a20a (python3)