simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Bug in samouts when `-f bam` #65

Closed adomingues closed 4 years ago

adomingues commented 6 years ago

Hi all,

I have have been trying to use htseq-count with the option --samout in my pipelines, and while testing it came upon what looks luke a bug: when usinf both -samout and f bam the output sam consists of only the tags, but not the alignments:

module load htseq/0.9.0
module load samtools/1.5
ESSENTIAL_GENES="/fsimb/groups/imb-kettinggr/genomes/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes_and_transposons.WBcel235.38.gtf"
aln="N2_wt_rep1_untreated.bam"
exp=$(echo $(basename $aln) | sed 's/.bam//')
echo $exp

htseq-count --samout=${exp}.test2.sam -s yes -f bam -m intersection-nonempty ${aln} ${ESSENTIAL_GENES} > ${exp}.counts

head ${exp}.test2.sam
        XF:Z:__too_low_aQual
        XF:Z:__no_feature
        XF:Z:__no_feature
        XF:Z:WBGene00023193
        XF:Z:WBGene00023193
        XF:Z:__no_feature
        XF:Z:WBGene00023193
        XF:Z:WBGene00023193
        XF:Z:WBGene00023193
        XF:Z:WBGene00023193

However, piping the alignments to htseq-count returns the expected output:

samtools view -h ${aln} | htseq-count --samout=${exp}.test1.sam -s yes -f sam -m intersection-nonempty - ${ESSENTIAL_GENES} > ${exp}.counts

head ${exp}.test1.sam
NB501946:201:H3HMGAFXY:1:21112:24049:15895      16      I       3115    0       22M     *       0      0TGATGTTCTACGCTTAAATTTT  EEEEEEEEEEEEEEEEEEEEEA  XA:i:0  MD:Z:22 NM:i:0  XM:i:2  XF:Z:__too_low_aQual
NB501946:201:H3HMGAFXY:2:11202:12920:18180      16      I       3685    255     21M     *       0      0TATCTACTAGGAATAACTCGA   EEEEEEEEEEEEEEEEEEEEA   XA:i:0  MD:Z:21 NM:i:0  XF:Z:__no_feature
NB501946:201:H3HMGAFXY:2:21105:7328:11769       0       I       3738    255     21M     *       0      0TGTAAAATAGAGGATCAGACC   AAEEEEEEEAAEE6EEEEEEE   XA:i:0  MD:Z:21 NM:i:0  XF:Z:__no_feature
NB501946:201:H3HMGAFXY:2:11112:5811:6494        16      I       3746    255     21M     *       0      0AGAGGATCAGACCCAAAATTC   EEEEEEEEEEEEEEEEEEEEA   XA:i:0  MD:Z:21 NM:i:0  XF:Z:WBGene00023193
NB501946:201:H3HMGAFXY:2:21204:5691:10822       16      I       3746    255     22M     *       0      0AGAGGATCAGACCCAAAATTCA  EEEEEEEEEEEEEAEEEEEEEA  XA:i:0  MD:Z:22 NM:i:0  XF:Z:WBGene00023193
NB501946:201:H3HMGAFXY:1:21209:12787:3406       0       I       3747    255     21M     *       0      0GAGGATCAGACCCAAAATTCA   AEEEEEEEEEEEEEEEEEEEE   XA:i:0  MD:Z:21 NM:i:0  XF:Z:__no_feature
NB501946:201:H3HMGAFXY:4:11512:6158:17507       16      I       3747    255     21M     *       0      0GAGGATCAGACCCAAAATTCA   EEEEEEEEEEEEEEEEEEEEA   XA:i:0  MD:Z:21 NM:i:0  XF:Z:WBGene00023193
NB501946:201:H3HMGAFXY:3:11503:17322:16736      16      I       3747    255     20M     *       0      0GAGGATCAGACCCAAAATTC    EEEEEEEEEEEEEEEEEEEA    XA:i:0  MD:Z:20 NM:i:0  XF:Z:WBGene00023193
NB501946:201:H3HMGAFXY:3:21403:11554:1616       16      I       3747    255     24M     *       0      0GAGGATCAGACCCAAAATTCAGCC        EEEAAEAEEEEE6EEEEEEEEEEA        XA:i:0  MD:Z:24 NM:i:0  XF:Z:WBGene00023193
NB501946:201:H3HMGAFXY:3:11607:5533:12910       16      I       3747    255     31M     *       0      0GAGGATCAGACCCAAAATTCAGCCCGCGAAG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA XA:i:0  MD:Z:31 NM:i:0  XF:Z:WBGene00023193

Cheers, António

adomingues commented 6 years ago

Addendum: the bug is also present in version 0.10.0

coolmak32 commented 5 years ago

Hi Domingues, I am having the same output as you mentioned in the first screenshot. I am running a BAM file. htseq-count -m union -f bam -s no -r name ALZT22-2Cunsorted.bam geneassembly.gff3 -o countread.text (BAM file is name sorted )

The output it generates is the following: XF:Z:ambiguous[ENSMUSG00000098178.1+ENSMUSG00000106106.2] XF:Z:ambiguous[ENSMUSG00000098178.1+ENSMUSG00000106106.2] XF:Z:alignment_not_unique XF:Z:alignment_not_unique XF:Z:alignment_not_unique XF:Z:alignment_not_unique XF:Z:no_feature XF:Z:__no_feature XF:Z:alignment_not_unique XF:Z:__alignment_not_unique

When I ran the SAM file, I received the following output:

chr1 3206084 255 1S139M = 3206084 -139 NTACAGTTAACCAACTTATACAGTTAACCAACTCCTACACTAGGTTCCTGAGCATTTCCTTAAACTTGCTAGTTCTGGTTTCCTGGCATGTGAGAGTAAGTCACATGGTAGGAGGCTGCCTTTCTATCJJJFJJFJJJAJFJJJFJFFJFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJA<<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFFJJJJJFJFJJJJJJJJFJJJJJJJJJJJJJJAFAAA NH:i:1 HI:i:1 AS:i:276 nM:i:0 XF:Z:ENSMUSG00000051951.5 GWNJ-0965:181:GW180227920:7:2124:9881:65265 163 chr1 3206084 255 139M1S = 3206084 139 TACAGTTAACCAACTTATACAGTTAACCAACTCCTACACTAGGTTCCTGAGCATTTCCTTAAACTTGCTAGTTCTGGTTTCCTGGCATGTGAGAGTAAGTCACATGGTAGGAGGCTGCCTTTCTATCATTCAATTTTAGN

Is there something somewhere I am wrong at?

Thanks

adomingues commented 5 years ago

Hi @coolmak32 ,

I think your command is missing this bit --samout=${exp}.test1.sam that saves the annotated alignments in the file ${exp}.test1.sam. Because this is not set that output is being redirected to the standard out.

@iosonofabio any news on when this issue could get fixed?

Cheers, António

simon-anders commented 5 years ago

HI Fabio

If I remember correctly, I used "original_sam_line" for the "sam_out" function, and that works only for sam. We should check. For bam, we should use 'get_sam_line".

Simon

iosonofabio commented 5 years ago

@adomingues are the count files the same? Also I am quite busy these days, if you could make a PR that'd speed this up significantly.

adomingues commented 5 years ago

@iosonofabio sorry for the late reply, but I am also quite overwhelmed these days.

are the count files the same?

As far as I can tell yes, but to be honest I barely looked at them. My goal was not to count reads, but rather annotate them for downstream processing.

if you could make a PR that'd speed this up significantly.

The priority for me now changed and I am no longer pursuing this annotation strategy in my pipeline. That said, I might have sometime to look into this in about a month.

iosonofabio commented 4 years ago

Stale for a long time, "-f bam" is deprecated now since the underlying library has autodetection anyway so it should not be a problem as stated in this issue anyway.