simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

0 BAM alignments processed #97

Closed kscott94 closed 4 years ago

kscott94 commented 4 years ago

Running htseq-count, but won't process bam file. argument: htseq-count -f bam -r pos ./paired_end_sorted.bam ./annotations.gff

samtools view -c indicates over 8 million reads in bam file.

Here is the head and tail output from htseq-count 2358 GFF lines processed. 0 BAM alignments processed. . . . TK2302 0 TK2303 0 TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0

head and tail of bam file produced by BSseeker2/bowtie2:

kscott94:~$ samtoools view -h paired_end_sorted.bam | head

@HD VN:1.0 SO:coordinate @SQ SN:TS559_Genomic_Sequence.seq LN:2087105 @PG PN:BS Seeker 2 ID:1 CL:/projects/kscott94@colostate.edu/tools/BSseeker2/bs_seeker2-align.py -i ./TK0008_exo_depl_cat_filtered.fastq -g /projects/kscott94@colostate.edu/genome/TS559_genome.fa --temp_dir=/scratch/summit/kscott94@colostate.edu/tmp -m 2 --XS=0.03,3 --bt2--mm --bt-p 8 --aligner=bowtie2 -p /projects/kscott94@colostate.edu/tools/bowtie2/ NS500355:NS500355:H5JJFAFXY:1:11203:24804:3180 0 TS559_Genomic_Sequence.seq 1 255 76M 0 0 ATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGAGT XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x---- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGAGT_TT NS500355:NS500355:H5JJFAFXY:1:21208:8349:10740 0 TS559_Genomic_Sequence.seq 1 255 2S74M 0 0 AGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x-- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGA_GT NS500355:NS500355:H5JJFAFXY:2:11104:14810:18979 0 TS559_Genomic_Sequence.seq 1 255 11S65M 0 0 GGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-------- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGA_AA . . . NS500355:NS500355:H5JJFAFXY:2:21206:18165:9780 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAGTTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:3:21402:17601:18156 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:4:21405:22021:1321 0 TS559_Genomic_Sequence.seq 2087061 255 45M31S 0 0 TAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:----------zy---------z--z-------------------- XG:Z:TT_TAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN

head of gff file

gff-version 3

TS559_Genomic_Sequence.seq manual region 1 2088737 . + . gene_id=id0;Dbxref=taxon:69014;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;note=synonym: Thermococcus kodakaraensis TS559;strain=TS559 TS559_Genomic_Sequence.seq BLATSn exon 1 5016 . + 0 gene_id=TK0001;gene_biotype=protein_coding;locus_tag=TK0001 TS559_Genomic_Sequence.seq BLATSn exon 5086 5733 . + 0 gene_id=TK0002;gene_biotype=protein_coding;locus_tag=TK0002 TS559_Genomic_Sequence.seq BLATSn exon 5730 6119 . + 0 gene_id=TK0003;gene_biotype=protein_coding;locus_tag=TK0003 TS559_Genomic_Sequence.seq BLATSn exon 6079 6528 . - 0 gene_id=TK0004;gene_biotype=protein_coding;locus_tag=TK0004 TS559_Genomic_Sequence.seq BLATSn exon 6586 7014 . + 0 gene_id=TK0005;gene_biotype=protein_coding;locus_tag=TK0005 TS559_Genomic_Sequence.seq BLATSn exon 7152 7427 . - 0 gene_id=TK0006;gene_biotype=protein_coding;locus_tag=TK0006

iosonofabio commented 4 years ago

Please try the name suited file and let me know

On Wed, Mar 4, 2020, at 04:41, Kristin Scott wrote:

Running htseq-count, but won't process bam file. argument: htseq-count -f bam -r pos ./paired_end_sorted.bam ./annotations.gff

samtools view -c indicates over 8 million reads in bam file.

Here is the head and tail output from htseq-count 2358 GFF lines processed. 0 BAM alignments processed. . . . TK2302 0 TK2303 0 TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0

head and tail of bam file produced by BSseeker2/bowtie2:

kscott94:~$ samtoools view -h paired_end_sorted.bam | head

@hd https://github.com/hd VN:1.0 SO:coordinate @sq https://github.com/sq SN:TS559_Genomic_Sequence.seq LN:2087105 @pg https://github.com/pg PN:BS Seeker 2 ID:1 CL:/projects/kscott94@colostate.edu/tools/BSseeker2/bs_seeker2-align.py -i ./TK0008_exo_depl_cat_filtered.fastq -g /projects/kscott94@colostate.edu/genome/TS559_genome.fa --temp_dir=/scratch/summit/kscott94@colostate.edu/tmp -m 2 --XS=0.03,3 --bt2--mm --bt-p 8 --aligner=bowtie2 -p /projects/kscott94@colostate.edu/tools/bowtie2/ NS500355:NS500355:H5JJFAFXY:1:11203:24804:3180 0 TS559_Genomic_Sequence.seq 1 255 76M 0 0 ATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGAGT XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x---- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGAGT_TT NS500355:NS500355:H5JJFAFXY:1:21208:8349:10740 0 TS559_Genomic_Sequence.seq 1 255 2S74M 0 0 AGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x-- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGA_GT NS500355:NS500355:H5JJFAFXY:2:11104:14810:18979 0 TS559_Genomic_Sequence.seq 1 255 11S65M 0 0 GGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-------- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGA_AA . . . NS500355:NS500355:H5JJFAFXY:2:21206:18165:9780 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAGTTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:3:21402:17601:18156 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:4:21405:22021:1321 0 TS559_Genomic_Sequence.seq 2087061 255 45M31S 0 0 TAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:----------zy---------z--z-------------------- XG:Z:TT_TAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN

head of gff file

gff-version 3

TS559_Genomic_Sequence.seq manual region 1 2088737 . + . gene_id=id0;Dbxref=taxon:69014;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;note=synonym: Thermococcus kodakaraensis TS559;strain=TS559 TS559_Genomic_Sequence.seq BLATSn exon 1 5016 . + 0 gene_id=TK0001;gene_biotype=protein_coding;locus_tag=TK0001 TS559_Genomic_Sequence.seq BLATSn exon 5086 5733 . + 0 gene_id=TK0002;gene_biotype=protein_coding;locus_tag=TK0002 TS559_Genomic_Sequence.seq BLATSn exon 5730 6119 . + 0 gene_id=TK0003;gene_biotype=protein_coding;locus_tag=TK0003 TS559_Genomic_Sequence.seq BLATSn exon 6079 6528 . - 0 gene_id=TK0004;gene_biotype=protein_coding;locus_tag=TK0004 TS559_Genomic_Sequence.seq BLATSn exon 6586 7014 . + 0 gene_id=TK0005;gene_biotype=protein_coding;locus_tag=TK0005 TS559_Genomic_Sequence.seq BLATSn exon 7152 7427 . - 0 gene_id=TK0006;gene_biotype=protein_coding;locus_tag=TK0006

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEAEAGSMVYTPZNNSYWLDRFU6MPA5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISDIAMA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAAJW54XGSWGSBML4W3RFU6MPANCNFSM4LAPRJFQ.

kscott94 commented 4 years ago

Sorted by name. Same result. 0 BAM alignments processed.

iosonofabio commented 4 years ago

Thanks for trying. I see you are using a GFF instead of a more canonical GTF. Or is it a GTF with the wrong name?

It'd be great if you could try with a small subsample (say 10,000 reads). If that does not work, happy to have a look at the subsample BAM and the GTF and try running it on my machine, just put them on google drive and share the link.

kscott94 commented 4 years ago

Thanks for helping me trouble shoot. Here is link to google drive: https://drive.google.com/open?id=1P6Dw8mKBIZVpzeqlPGZzIDcMlp9DJgQP

This should have the gff file and subsetted bam file.

iosonofabio commented 4 years ago

It works from master branch on my laptop, here the end of the file:

tRNA-Leu-6      0
tRNA-Lys-1      1
tRNA-Lys-2      0
tRNA-Met-1      2
tRNA-Met-2      0
tRNA-Met-3      0
tRNA-Phe        0
tRNA-Pro-1      0
tRNA-Pro-2      0
tRNA-Pro-3      0
tRNA-Ser-1      0
tRNA-Ser-2      0
tRNA-Ser-3      0
tRNA-Ser-4      0
tRNA-Thr-1      0
tRNA-Thr-2      0
tRNA-Thr-3      0
tRNA-Trp        5
tRNA-Tyr        0
tRNA-Val-1      0
tRNA-Val-2      0
tRNA-Val-3      0
tRNA-Val-4      0
__no_feature    290
__ambiguous     270
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0

You can try cloning the master branch from git and see. I build it with:

python setup.py build

and then run it with:

PYTHONPATH=build/lib.linux-x86_64-3.8:$PYTHONPATH python ./build/lib.linux-x86_64-3.8/HTSeq/scripts/count.py example_data/name_sorted_subsample_paired.bam example_data/TS559_manual_annotation.gff
kscott94 commented 4 years ago

That did the trick! Thank you!

iosonofabio commented 4 years ago

Great! Do you mind if I anonymyze the data and keep it in the repo for testing?

Thank you Fabio

On Sun, Mar 8, 2020, at 16:08, Kristin Scott wrote:

That did the trick! Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEAEHTJMRPPFUNUOYIJTRGMR45A5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEMWQA#issuecomment-596167488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAD345AASNCMKVHX2TDRGMR45ANCNFSM4LAPRJFQ.

kscott94 commented 4 years ago

Sure. No problem!

iosonofabio commented 4 years ago

Btw just discovered the problem. Your file is a SAM, not a BAM. Also, your annotation file is GTF, not GFF3, as far as I can tell. So that's that, make sure you know your file formats.

Still, htseq should give an error, so I'll look into that.

kscott94 commented 4 years ago

When I Subsetted, I converted it to a Sam. Sorry I forgot the change the extension.

Also GTF and gff3 are the same file format. What I sent you were the latest formats that I tried.

kscott94 commented 4 years ago

The problem was solved when I called htseq-count from the count.py script instead of htseq-count as a stand alone argument.

iosonofabio commented 4 years ago

Thank you. That's basically just a link, so it can't be the problem.

Anyway I contacted upstream at pysam and the issue is clear, I'll adapt the docs tonight.

On Mon, Mar 9, 2020, at 00:54, Kristin Scott wrote:

The problem was solved when I called htseq-count from the count.py script instead of htseq-count

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEACJWWKZMC7BVSE63ZDRGOWS5A5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEWVEA#issuecomment-596208272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEABVAN5O2M24Z54CMMLRGOWS5ANCNFSM4LAPRJFQ.

rached-97 commented 4 years ago

I am experiencing a similar issue with htseq-count: "0 BAM alignments processed". My file is definitely in BAM format, coordinate sorted, single end, non-stranded library. I acquired my GTF file from Ensembl (Homo_sapiens.GRCh37.87.gtf.gz). Chromosome names have the same format in both BAM and GTF.

Here is my htseq-count command:

htseq-count -f bam -r pos -s no -a 30 -t exon -i gene_id -m intersection-nonempty BAM GTF

Numerous reads in the bam pass mapq 30 threshold.

Software versions:

python 3.8.0 htseq 0.11.3 pysam 0.15.4 numpy 1.17.4 samtools 1.10

First few reads from BAM (I replaced bases with Ns for privacy reasons. Notice that base qualities have been removed to conserve disk space. BAM passes samtools quickcheck):

@HD VN:1.0 SO:coordinate @PG ID:STAR PN:STAR VN:STAR_2.3.0e_r291 cl:STAR --chimSegmentMin 15 --chimJunctionOverhangMin 15 --outSAMstrandField intronMotif --genomeDir /sc/orga/projects/PBG/REFERENCES/hg19/star/2.4.0d/Homo_sapiens.GRCh37.ensembl70.overhang75bp --sjdbGTFfile /sc/orga/projects/PBG/REFERENCES/hg19/ensembl/Homo_sapiens.GRCh37.70.processed.gtf --runThreadN 10 --outReadsUnmapped fastx --outStd SAM --outSAMmode Full --outFileNamePrefix /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Processed/RAPiD.2_0_0/star/accepted_hits --readFilesCommand zcat --readFilesIn /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Raw/RNA.IlluminaHiSeq2500.RiboZero/BM_10_548_ATCACG_L007_R1_001.C6A5BANXX.fastq.gz CL:STAR --runThreadN 10 --genomeDir /sc/orga/projects/PBG/REFERENCES/hg19/star/2.4.0d/Homo_sapiens.GRCh37.ensembl70.overhang75bp --readFilesIn /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Raw/RNA.IlluminaHiSeq2500.RiboZero/BM_10_548_ATCACG_L007_R1_001.C6A5BANXX.fastq.gz --readFilesCommand zcat --outFileNamePrefix /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Processed/RAPiD.2_0_0/star/accepted_hits --outStd SAM --outSAMmode Full --outSAMstrandField intronMotif --outReadsUnmapped fastx --chimSegmentMin 15 --chimJunctionOverhangMin 15 --sjdbGTFfile /sc/orga/projects/PBG/REFERENCES/hg19/ensembl/Homo_sapiens.GRCh37.70.processed.gtf @PG ID:samtools PN:samtools PP:STAR VN:1.10 CL:samtools view -h -T /scratch/USER/GRCh37.p13.genome.fa -b /project/0000000/USER/MSBB_synapse/bam_to_cram/BM_10_548.accepted_hits.sort.coord.bam.cram @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -h BM_10_548.accepted_hits.sort.coord.bam @SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr10 LN:135534747 M5:988c28e000e84c26d552359af1ea2e1d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr11 LN:135006516 M5:98c59049a2df285c76ffb1c6db8f8b96 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr12 LN:133851895 M5:51851ac0e1a115847ad36449b0015864 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr13 LN:115169878 M5:283f8d7892baa81b510a015719ca7b0b UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr14 LN:107349540 M5:98f3cae32b2a2e9524bc19813927542e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr15 LN:102531392 M5:e5645a794a8238215b2cd77acb95a078 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr16 LN:90354753 M5:fc9b1a7b42b97a864f56b348b06095e6 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr17 LN:81195210 M5:351f64d4f4f9ddd45b35336ad97aa6de UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr18 LN:78077248 M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr19 LN:59128983 M5:1aacd71f30db8e561810913e0b72636d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr21 LN:48129895 M5:2979a6085bfe28e3ad6f552f361ed74d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr22 LN:51304566 M5:a718acaa6135fdca8357d5bfe94211dd UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrM LN:16569 M5:c68f52674c9fb33aef52dcf399755519 UR:/scratch/USER/GRCh37.p13.genome.fa @RG ID:BM_10_548 PL:ILLUMINA PU:HiSeq2500 LB:BM_10_548 DS:rnaseq SM:BM_10_548 CN:MSSM HISEQ:147:C6A5BANXX:7:1206:13415:49816 16 chr1 10004 255 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:1 AS:i:96 MD:Z:21T78 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1107:16113:24283 0 chr1 10006 255 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:2 AS:i:94 MD:Z:79T9C10 NM:i:2 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1209:7477:64467 16 chr1 10044 1 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:4 HI:i:4 nM:i:1 AS:i:96 MD:Z:94T5 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:2116:16620:69070 0 chr1 10148 255 27S73M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:1 AS:i:69 MD:Z:69A3 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1209:17895:49628 16 chr1 10541 0 98M2S 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:5 HI:i:2 nM:i:4 AS:i:88 MD:Z:19C62T10G0C3 NM:i:4 RG:Z:BM_10_548

First few lines from GTF:

gff-version 2

source-version rtracklayer 1.46.0

date 2020-03-16

chr1 ensembl_havana gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; chr1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; chr1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; exon_number "1"; exon_id "ENSE00002234944"; exon_version "1"; chr1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; exon_number "2"; exon_id "ENSE00003582793"; exon_version "1";

I do not get counts at all:

no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0

iosonofabio commented 4 years ago

please try sorting the BAM file by name

rached-97 commented 4 years ago

Thank you for the prompt response, I appreciate it! I will attempt to run htseq-count on a name sorted bam. Although, I do not understand why a coordinate sorted bam would be a problem given the library is single end.

iosonofabio commented 4 years ago

I don't know either... Please let me know if you can actually track down the problem, if love to fix htseq with your help!

On Tue, Mar 17, 2020, at 15:00, rached-97 wrote:

Thank you for the prompt response, I appreciate it! I will attempt to run htseq-count on a name sorted bam. Although, I do not understand why a coordinate sorted bam would be a problem given the library is single end.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-599865804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAG2L62NSC7PSIWFDR3RH3YULANCNFSM4LAPRJFQ.

rached-97 commented 4 years ago

I can confirm experiencing the same error using a name sorted bam.

kscott94 commented 4 years ago

I did a pip install and used htseq-count, which produced the issue. I did a pip uninstall, then re installed with setup.py. I then called the actual count.py script and it worked.

iosonofabio commented 4 years ago

rached please confirm if that fixes it, then it's a bug with the pypi version and can fix that

On Wed, Mar 18, 2020, at 07:19, Kristin Scott wrote:

I did a pip install and used htseq-count, which produced the issue. I did a pip uninstall, then re installed with setup.py. I then called the actual count.py script and it worked.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-600277519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEABNFMLZBDCRZFRTVKDRH7LNXANCNFSM4LAPRJFQ.

rached-97 commented 4 years ago

Due to file permission issues that I was unable to resolve, I could not install HTSeq using setup.py. I did successfully install version 0.11.2 using pip and that has worked fine with my coordinate sorted BAM.

iosonofabio commented 4 years ago

Thanks for the effort. I'll fix the pip version.

On 3/19/20 3:43 PM, rached-97 wrote:

Due to file permission issues that I was unable to resolve, I could not install HTSeq using setup.py. I did successfully install version 0.11.2 using pip and that has worked fine with my coordinate sorted BAM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-600985703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAAPJMXVT577XPNAG3LRIGPIBANCNFSM4LAPRJFQ.

kscott94 commented 4 years ago

Something new to report.. I was unable to get htseq to count reads in a bam file, It had to be a sam file. I specified -f bam, but the bam file would not be processed. It works fine with both name and position sorted sam files. I don't think the pip install was bugged.

htseq-count doesn't like something about bam files. Only likes sam files.

iosonofabio commented 4 years ago

Hmm, this contradicts with your previous report: you previously wrote that htseq-count worked when installed via setup.py.

What does it mean that the BAM file "would not be processed"? Can you send some screenshots showing the errors? Otherwise it's hard for me to understand what's going on since the automatic tests (including processing SAM and BAM files) are all passed successfully.

Also notice: since the latest version, htseq acknowledges the fact that the underlying library to parse SAM and BAM files has automatic filetype detection, so it would be helpful if you tested htseq-count on a SAM file and then convert it into BAM and test htseq-count on the converted file.

Thank you,

kscott94 commented 4 years ago

Hi

So when i reported that it worked, i was using a sam file (when i thought i was using a bam file). I subsetted the bam file to 10,000 reads, and when doing that, I converted to sam and forgot. So when i tested it with the subsetted sam, it worked, I thought it was because I re installed using set-up.py.

But now I am going back to get the count data, and i have not subsetted, so the files are still bam files. Now I have a bunch of bam files that i am trying to count. there is no error message. It is just that 0 bam alignments processed. The same output as originally reported when i opened this issue.

Despite over 7 milion reads, the output reads:

0 BAM alignments processed. . . no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0

BTW the way i subsetted or "converted" to sam file is: samtools view -h reads.bam > reads.sam

Works like a charm. This could be an issue with just my bam files. But since another person experienced the same issue, it might be something more complicated. My bam files were produced from BS-seeker2, which builds off bowtie2 in my case.

I will do some tests as you suggest and will report again.

Hope this helps.

iosonofabio commented 4 years ago

Thank you Kristin, this clarifies the contradiction.

I would suggest you:

  1. take a SAM file that works
  2. double check htseq-count works correctly on this file
  3. convert it to BAM via samtools - samtools view -b -h -o testfile.bam testfile.sam
  4. test htseq-count on this bam file
  5. report the results/errors/screenshots

Thank you

kscott94 commented 4 years ago

So i just converted the sam file to bam file and the issue persisted. 0 BAM alignments processed. The sam file worked previously.

SAM output:

$ htseq-count testfile.sam /projects/kscott94\@colostate.edu/genome/TS559_Blastn_annotations.gff

2305 GFF lines processed. 9997 SAM alignments processed. TK0001 8 TK0002 0 TK0003 0

. . . TK2303 2 TK2304 7 TK2305 0 TK2306 0 no_feature 311 ambiguous 270 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0


BAM output: $ htseq-count -f bam testfile.bam /projects/kscott94\@colostate.edu/genome/TS559_Blastn_annotations.gff

2305 GFF lines processed. 0 BAM alignments processed. TK0001 0 TK0002 0 TK0003 0 TK0004 0 TK0005 0 . . . TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0

iosonofabio commented 4 years ago

I redownloaded the files from your google drive folder. Notice you called it BAM there but it's a SAM. I ran htseq-count on that SAM and gff file (which is actually a GTF, please rename) and it gives zero counts.

So I'm confused because not even the SAM file works here now. I'll look into it

kscott94 commented 4 years ago

I haven't touch the files in the drive. If it isn't working for you now, I had nothing to do with it. And yes, I have already said three times I converted to sam while subsetting, but forgot to rename it.

The name of the annotation file is inconsequential right now.

iosonofabio commented 4 years ago

When I find a minute I'll have a look. Doesn't seem to affect too many other people except you, which is good, hopefully a very specific issue with your files.

It'd be fantastic if you could also have a look at the code and see if you can debug a bit.

iosonofabio commented 4 years ago

I tracked this down to an inconsistency plaguing pysam for a long time. I now pushed a new commit to master and will test it. If everything works, I'll make a release.

iosonofabio commented 4 years ago

Fixed in new release, 0.11.4. Closing, reopen if that version is still failing.

koellingh commented 3 years ago

Hey @iosonofabio, I ran into this issue as well with the most recent release of htseq.

Here is the command I was using: htseq-count --order=pos --stranded=no --type=gene --idattr=gene_id --additional-attr=gene_name --mode=union --nonunique=none myfile.sortedByCoord.out.bam ~/human_reference_genome.gtf > htseq_output.txt

This command processed my gtf file but processed 0 BAM alignments.

After reading through the Usage documentation, I tried using python -m HTSeq.scripts.count rather than htseq-count and it worked perfectly. The documentation says that python -m HTSeq.scripts.count can be used if htseq-count is not in the path, but this does not seem to be the case for me- if htseq-count weren't in my path, the program would not be able to read the gtf file. Am I correct with this assumption?

Do you have any idea why this might be the case? My hunch is that it has to do with the way I set up htseq on my machine. I followed the installation documentation for Ubuntu. Perhaps I wouldn't run into this issue if I installed from Git? Or from Pip? I can look into this.

Although I was able to successfully use htseq, I figured it might be helpful for future users who may run into this problem, and I am curious why this occurred.