Closed kscott94 closed 4 years ago
Please try the name suited file and let me know
On Wed, Mar 4, 2020, at 04:41, Kristin Scott wrote:
Running htseq-count, but won't process bam file. argument: htseq-count -f bam -r pos ./paired_end_sorted.bam ./annotations.gff
samtools view -c indicates over 8 million reads in bam file.
Here is the head and tail output from htseq-count 2358 GFF lines processed. 0 BAM alignments processed. . . . TK2302 0 TK2303 0 TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
head and tail of bam file produced by BSseeker2/bowtie2:
kscott94:~$ samtoools view -h paired_end_sorted.bam | head
@hd https://github.com/hd VN:1.0 SO:coordinate @sq https://github.com/sq SN:TS559_Genomic_Sequence.seq LN:2087105 @pg https://github.com/pg PN:BS Seeker 2 ID:1 CL:/projects/kscott94@colostate.edu/tools/BSseeker2/bs_seeker2-align.py -i ./TK0008_exo_depl_cat_filtered.fastq -g /projects/kscott94@colostate.edu/genome/TS559_genome.fa --temp_dir=/scratch/summit/kscott94@colostate.edu/tmp -m 2 --XS=0.03,3 --bt2--mm --bt-p 8 --aligner=bowtie2 -p /projects/kscott94@colostate.edu/tools/bowtie2/ NS500355:NS500355:H5JJFAFXY:1:11203:24804:3180 0 TS559_Genomic_Sequence.seq 1 255 76M 0 0 ATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGAGT XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x---- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGAGT_TT NS500355:NS500355:H5JJFAFXY:1:21208:8349:10740 0 TS559_Genomic_Sequence.seq 1 255 2S74M 0 0 AGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x-- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGA_GT NS500355:NS500355:H5JJFAFXY:2:11104:14810:18979 0 TS559_Genomic_Sequence.seq 1 255 11S65M 0 0 GGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-------- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGA_AA . . . NS500355:NS500355:H5JJFAFXY:2:21206:18165:9780 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAGTTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:3:21402:17601:18156 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:4:21405:22021:1321 0 TS559_Genomic_Sequence.seq 2087061 255 45M31S 0 0 TAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:----------zy---------z--z-------------------- XG:Z:TT_TAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN
head of gff file
gff-version 3
TS559_Genomic_Sequence.seq manual region 1 2088737 . + . gene_id=id0;Dbxref=taxon:69014;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;note=synonym: Thermococcus kodakaraensis TS559;strain=TS559 TS559_Genomic_Sequence.seq BLATSn exon 1 5016 . + 0 gene_id=TK0001;gene_biotype=protein_coding;locus_tag=TK0001 TS559_Genomic_Sequence.seq BLATSn exon 5086 5733 . + 0 gene_id=TK0002;gene_biotype=protein_coding;locus_tag=TK0002 TS559_Genomic_Sequence.seq BLATSn exon 5730 6119 . + 0 gene_id=TK0003;gene_biotype=protein_coding;locus_tag=TK0003 TS559_Genomic_Sequence.seq BLATSn exon 6079 6528 . - 0 gene_id=TK0004;gene_biotype=protein_coding;locus_tag=TK0004 TS559_Genomic_Sequence.seq BLATSn exon 6586 7014 . + 0 gene_id=TK0005;gene_biotype=protein_coding;locus_tag=TK0005 TS559_Genomic_Sequence.seq BLATSn exon 7152 7427 . - 0 gene_id=TK0006;gene_biotype=protein_coding;locus_tag=TK0006
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEAEAGSMVYTPZNNSYWLDRFU6MPA5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISDIAMA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAAJW54XGSWGSBML4W3RFU6MPANCNFSM4LAPRJFQ.
Sorted by name. Same result. 0 BAM alignments processed.
Thanks for trying. I see you are using a GFF instead of a more canonical GTF. Or is it a GTF with the wrong name?
It'd be great if you could try with a small subsample (say 10,000 reads). If that does not work, happy to have a look at the subsample BAM and the GTF and try running it on my machine, just put them on google drive and share the link.
Thanks for helping me trouble shoot. Here is link to google drive: https://drive.google.com/open?id=1P6Dw8mKBIZVpzeqlPGZzIDcMlp9DJgQP
This should have the gff file and subsetted bam file.
It works from master branch on my laptop, here the end of the file:
tRNA-Leu-6 0
tRNA-Lys-1 1
tRNA-Lys-2 0
tRNA-Met-1 2
tRNA-Met-2 0
tRNA-Met-3 0
tRNA-Phe 0
tRNA-Pro-1 0
tRNA-Pro-2 0
tRNA-Pro-3 0
tRNA-Ser-1 0
tRNA-Ser-2 0
tRNA-Ser-3 0
tRNA-Ser-4 0
tRNA-Thr-1 0
tRNA-Thr-2 0
tRNA-Thr-3 0
tRNA-Trp 5
tRNA-Tyr 0
tRNA-Val-1 0
tRNA-Val-2 0
tRNA-Val-3 0
tRNA-Val-4 0
__no_feature 290
__ambiguous 270
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
You can try cloning the master branch from git and see. I build it with:
python setup.py build
and then run it with:
PYTHONPATH=build/lib.linux-x86_64-3.8:$PYTHONPATH python ./build/lib.linux-x86_64-3.8/HTSeq/scripts/count.py example_data/name_sorted_subsample_paired.bam example_data/TS559_manual_annotation.gff
That did the trick! Thank you!
Great! Do you mind if I anonymyze the data and keep it in the repo for testing?
Thank you Fabio
On Sun, Mar 8, 2020, at 16:08, Kristin Scott wrote:
That did the trick! Thank you!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEAEHTJMRPPFUNUOYIJTRGMR45A5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEMWQA#issuecomment-596167488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAD345AASNCMKVHX2TDRGMR45ANCNFSM4LAPRJFQ.
Sure. No problem!
Btw just discovered the problem. Your file is a SAM, not a BAM. Also, your annotation file is GTF, not GFF3, as far as I can tell. So that's that, make sure you know your file formats.
Still, htseq should give an error, so I'll look into that.
When I Subsetted, I converted it to a Sam. Sorry I forgot the change the extension.
Also GTF and gff3 are the same file format. What I sent you were the latest formats that I tried.
The problem was solved when I called htseq-count from the count.py script instead of htseq-count as a stand alone argument.
Thank you. That's basically just a link, so it can't be the problem.
Anyway I contacted upstream at pysam and the issue is clear, I'll adapt the docs tonight.
On Mon, Mar 9, 2020, at 00:54, Kristin Scott wrote:
The problem was solved when I called htseq-count from the count.py script instead of htseq-count
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97?email_source=notifications&email_token=AAJFEACJWWKZMC7BVSE63ZDRGOWS5A5CNFSM4LAPRJF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOEWVEA#issuecomment-596208272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEABVAN5O2M24Z54CMMLRGOWS5ANCNFSM4LAPRJFQ.
I am experiencing a similar issue with htseq-count: "0 BAM alignments processed". My file is definitely in BAM format, coordinate sorted, single end, non-stranded library. I acquired my GTF file from Ensembl (Homo_sapiens.GRCh37.87.gtf.gz). Chromosome names have the same format in both BAM and GTF.
Here is my htseq-count command:
htseq-count -f bam -r pos -s no -a 30 -t exon -i gene_id -m intersection-nonempty BAM GTF
Numerous reads in the bam pass mapq 30 threshold.
Software versions:
python 3.8.0 htseq 0.11.3 pysam 0.15.4 numpy 1.17.4 samtools 1.10
First few reads from BAM (I replaced bases with Ns for privacy reasons. Notice that base qualities have been removed to conserve disk space. BAM passes samtools quickcheck):
@HD VN:1.0 SO:coordinate @PG ID:STAR PN:STAR VN:STAR_2.3.0e_r291 cl:STAR --chimSegmentMin 15 --chimJunctionOverhangMin 15 --outSAMstrandField intronMotif --genomeDir /sc/orga/projects/PBG/REFERENCES/hg19/star/2.4.0d/Homo_sapiens.GRCh37.ensembl70.overhang75bp --sjdbGTFfile /sc/orga/projects/PBG/REFERENCES/hg19/ensembl/Homo_sapiens.GRCh37.70.processed.gtf --runThreadN 10 --outReadsUnmapped fastx --outStd SAM --outSAMmode Full --outFileNamePrefix /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Processed/RAPiD.2_0_0/star/accepted_hits --readFilesCommand zcat --readFilesIn /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Raw/RNA.IlluminaHiSeq2500.RiboZero/BM_10_548_ATCACG_L007_R1_001.C6A5BANXX.fastq.gz CL:STAR --runThreadN 10 --genomeDir /sc/orga/projects/PBG/REFERENCES/hg19/star/2.4.0d/Homo_sapiens.GRCh37.ensembl70.overhang75bp --readFilesIn /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Raw/RNA.IlluminaHiSeq2500.RiboZero/BM_10_548_ATCACG_L007_R1_001.C6A5BANXX.fastq.gz --readFilesCommand zcat --outFileNamePrefix /sc/orga/scratch/shahh06/GCF/outgoing/ProductionQC/QC_S111.B394_Brain_RZero_Venu_2014_Nov.SE.RNASeqRibozero.RAPiD.Human/BM_10_548/Processed/RAPiD.2_0_0/star/accepted_hits --outStd SAM --outSAMmode Full --outSAMstrandField intronMotif --outReadsUnmapped fastx --chimSegmentMin 15 --chimJunctionOverhangMin 15 --sjdbGTFfile /sc/orga/projects/PBG/REFERENCES/hg19/ensembl/Homo_sapiens.GRCh37.70.processed.gtf @PG ID:samtools PN:samtools PP:STAR VN:1.10 CL:samtools view -h -T /scratch/USER/GRCh37.p13.genome.fa -b /project/0000000/USER/MSBB_synapse/bam_to_cram/BM_10_548.accepted_hits.sort.coord.bam.cram @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -h BM_10_548.accepted_hits.sort.coord.bam @SQ SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr10 LN:135534747 M5:988c28e000e84c26d552359af1ea2e1d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr11 LN:135006516 M5:98c59049a2df285c76ffb1c6db8f8b96 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr12 LN:133851895 M5:51851ac0e1a115847ad36449b0015864 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr13 LN:115169878 M5:283f8d7892baa81b510a015719ca7b0b UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr14 LN:107349540 M5:98f3cae32b2a2e9524bc19813927542e UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr15 LN:102531392 M5:e5645a794a8238215b2cd77acb95a078 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr16 LN:90354753 M5:fc9b1a7b42b97a864f56b348b06095e6 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr17 LN:81195210 M5:351f64d4f4f9ddd45b35336ad97aa6de UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr18 LN:78077248 M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr19 LN:59128983 M5:1aacd71f30db8e561810913e0b72636d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr21 LN:48129895 M5:2979a6085bfe28e3ad6f552f361ed74d UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chr22 LN:51304566 M5:a718acaa6135fdca8357d5bfe94211dd UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrY LN:59373566 M5:1e86411d73e6f00a10590f976be01623 UR:/scratch/USER/GRCh37.p13.genome.fa @SQ SN:chrM LN:16569 M5:c68f52674c9fb33aef52dcf399755519 UR:/scratch/USER/GRCh37.p13.genome.fa @RG ID:BM_10_548 PL:ILLUMINA PU:HiSeq2500 LB:BM_10_548 DS:rnaseq SM:BM_10_548 CN:MSSM HISEQ:147:C6A5BANXX:7:1206:13415:49816 16 chr1 10004 255 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:1 AS:i:96 MD:Z:21T78 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1107:16113:24283 0 chr1 10006 255 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:2 AS:i:94 MD:Z:79T9C10 NM:i:2 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1209:7477:64467 16 chr1 10044 1 100M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:4 HI:i:4 nM:i:1 AS:i:96 MD:Z:94T5 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:2116:16620:69070 0 chr1 10148 255 27S73M 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:1 HI:i:1 nM:i:1 AS:i:69 MD:Z:69A3 NM:i:1 RG:Z:BM_10_548 HISEQ:147:C6A5BANXX:7:1209:17895:49628 16 chr1 10541 0 98M2S 0 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NH:i:5 HI:i:2 nM:i:4 AS:i:88 MD:Z:19C62T10G0C3 NM:i:4 RG:Z:BM_10_548
First few lines from GTF:
chr1 ensembl_havana gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; chr1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; chr1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; exon_number "1"; exon_id "ENSE00002234944"; exon_version "1"; chr1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; transcript_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; exon_number "2"; exon_id "ENSE00003582793"; exon_version "1";
I do not get counts at all:
no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
please try sorting the BAM file by name
Thank you for the prompt response, I appreciate it! I will attempt to run htseq-count on a name sorted bam. Although, I do not understand why a coordinate sorted bam would be a problem given the library is single end.
I don't know either... Please let me know if you can actually track down the problem, if love to fix htseq with your help!
On Tue, Mar 17, 2020, at 15:00, rached-97 wrote:
Thank you for the prompt response, I appreciate it! I will attempt to run htseq-count on a name sorted bam. Although, I do not understand why a coordinate sorted bam would be a problem given the library is single end.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-599865804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAG2L62NSC7PSIWFDR3RH3YULANCNFSM4LAPRJFQ.
I can confirm experiencing the same error using a name sorted bam.
I did a pip install and used htseq-count, which produced the issue. I did a pip uninstall, then re installed with setup.py. I then called the actual count.py script and it worked.
rached please confirm if that fixes it, then it's a bug with the pypi version and can fix that
On Wed, Mar 18, 2020, at 07:19, Kristin Scott wrote:
I did a pip install and used htseq-count, which produced the issue. I did a pip uninstall, then re installed with setup.py. I then called the actual count.py script and it worked.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-600277519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEABNFMLZBDCRZFRTVKDRH7LNXANCNFSM4LAPRJFQ.
Due to file permission issues that I was unable to resolve, I could not install HTSeq using setup.py. I did successfully install version 0.11.2 using pip and that has worked fine with my coordinate sorted BAM.
Thanks for the effort. I'll fix the pip version.
On 3/19/20 3:43 PM, rached-97 wrote:
Due to file permission issues that I was unable to resolve, I could not install HTSeq using setup.py. I did successfully install version 0.11.2 using pip and that has worked fine with my coordinate sorted BAM.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simon-anders/htseq/issues/97#issuecomment-600985703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFEAAPJMXVT577XPNAG3LRIGPIBANCNFSM4LAPRJFQ.
Something new to report.. I was unable to get htseq to count reads in a bam file, It had to be a sam file. I specified -f bam, but the bam file would not be processed. It works fine with both name and position sorted sam files. I don't think the pip install was bugged.
htseq-count doesn't like something about bam files. Only likes sam files.
Hmm, this contradicts with your previous report: you previously wrote that htseq-count worked when installed via setup.py.
What does it mean that the BAM file "would not be processed"? Can you send some screenshots showing the errors? Otherwise it's hard for me to understand what's going on since the automatic tests (including processing SAM and BAM files) are all passed successfully.
Also notice: since the latest version, htseq acknowledges the fact that the underlying library to parse SAM and BAM files has automatic filetype detection, so it would be helpful if you tested htseq-count on a SAM file and then convert it into BAM and test htseq-count on the converted file.
Thank you,
Hi
So when i reported that it worked, i was using a sam file (when i thought i was using a bam file). I subsetted the bam file to 10,000 reads, and when doing that, I converted to sam and forgot. So when i tested it with the subsetted sam, it worked, I thought it was because I re installed using set-up.py.
But now I am going back to get the count data, and i have not subsetted, so the files are still bam files. Now I have a bunch of bam files that i am trying to count. there is no error message. It is just that 0 bam alignments processed. The same output as originally reported when i opened this issue.
Despite over 7 milion reads, the output reads:
0 BAM alignments processed. . . no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
BTW the way i subsetted or "converted" to sam file is: samtools view -h reads.bam > reads.sam
Works like a charm. This could be an issue with just my bam files. But since another person experienced the same issue, it might be something more complicated. My bam files were produced from BS-seeker2, which builds off bowtie2 in my case.
I will do some tests as you suggest and will report again.
Hope this helps.
Thank you Kristin, this clarifies the contradiction.
I would suggest you:
samtools view -b -h -o testfile.bam testfile.sam
Thank you
So i just converted the sam file to bam file and the issue persisted. 0 BAM alignments processed. The sam file worked previously.
SAM output:
$ htseq-count testfile.sam /projects/kscott94\@colostate.edu/genome/TS559_Blastn_annotations.gff
2305 GFF lines processed. 9997 SAM alignments processed. TK0001 8 TK0002 0 TK0003 0
. . . TK2303 2 TK2304 7 TK2305 0 TK2306 0 no_feature 311 ambiguous 270 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
BAM output: $ htseq-count -f bam testfile.bam /projects/kscott94\@colostate.edu/genome/TS559_Blastn_annotations.gff
2305 GFF lines processed. 0 BAM alignments processed. TK0001 0 TK0002 0 TK0003 0 TK0004 0 TK0005 0 . . . TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
I redownloaded the files from your google drive folder. Notice you called it BAM there but it's a SAM. I ran htseq-count on that SAM and gff file (which is actually a GTF, please rename) and it gives zero counts.
So I'm confused because not even the SAM file works here now. I'll look into it
I haven't touch the files in the drive. If it isn't working for you now, I had nothing to do with it. And yes, I have already said three times I converted to sam while subsetting, but forgot to rename it.
The name of the annotation file is inconsequential right now.
When I find a minute I'll have a look. Doesn't seem to affect too many other people except you, which is good, hopefully a very specific issue with your files.
It'd be fantastic if you could also have a look at the code and see if you can debug a bit.
I tracked this down to an inconsistency plaguing pysam for a long time. I now pushed a new commit to master and will test it. If everything works, I'll make a release.
Fixed in new release, 0.11.4. Closing, reopen if that version is still failing.
Hey @iosonofabio, I ran into this issue as well with the most recent release of htseq.
Here is the command I was using:
htseq-count --order=pos --stranded=no --type=gene --idattr=gene_id --additional-attr=gene_name --mode=union --nonunique=none myfile.sortedByCoord.out.bam ~/human_reference_genome.gtf > htseq_output.txt
This command processed my gtf file but processed 0 BAM alignments.
After reading through the Usage documentation, I tried using python -m HTSeq.scripts.count
rather than htseq-count
and it worked perfectly. The documentation says that python -m HTSeq.scripts.count
can be used if htseq-count
is not in the path, but this does not seem to be the case for me- if htseq-count
weren't in my path, the program would not be able to read the gtf file. Am I correct with this assumption?
Do you have any idea why this might be the case? My hunch is that it has to do with the way I set up htseq on my machine. I followed the installation documentation for Ubuntu. Perhaps I wouldn't run into this issue if I installed from Git? Or from Pip? I can look into this.
Although I was able to successfully use htseq, I figured it might be helpful for future users who may run into this problem, and I am curious why this occurred.
Running htseq-count, but won't process bam file. argument: htseq-count -f bam -r pos ./paired_end_sorted.bam ./annotations.gff
samtools view -c indicates over 8 million reads in bam file.
Here is the head and tail output from htseq-count 2358 GFF lines processed. 0 BAM alignments processed. . . . TK2302 0 TK2303 0 TK2304 0 TK2305 0 TK2306 0 no_feature 0 ambiguous 0 too_low_aQual 0 __not_aligned 0 alignment_not_unique 0
head and tail of bam file produced by BSseeker2/bowtie2:
kscott94:~$ samtoools view -h paired_end_sorted.bam | head
@HD VN:1.0 SO:coordinate @SQ SN:TS559_Genomic_Sequence.seq LN:2087105 @PG PN:BS Seeker 2 ID:1 CL:/projects/kscott94@colostate.edu/tools/BSseeker2/bs_seeker2-align.py -i ./TK0008_exo_depl_cat_filtered.fastq -g /projects/kscott94@colostate.edu/genome/TS559_genome.fa --temp_dir=/scratch/summit/kscott94@colostate.edu/tmp -m 2 --XS=0.03,3 --bt2--mm --bt-p 8 --aligner=bowtie2 -p /projects/kscott94@colostate.edu/tools/bowtie2/ NS500355:NS500355:H5JJFAFXY:1:11203:24804:3180 0 TS559_Genomic_Sequence.seq 1 255 76M 0 0 ATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGAGT XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x---- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGAGT_TT NS500355:NS500355:H5JJFAFXY:1:21208:8349:10740 0 TS559_Genomic_Sequence.seq 1 255 2S74M 0 0 AGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGAAAATGGTGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-----------x--x-- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGAAAACGGCGA_GT NS500355:NS500355:H5JJFAFXY:2:11104:14810:18979 0 TS559_Genomic_Sequence.seq 1 255 11S65M 0 0 GGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGGATGGAAAGTTTGTTATAAGAATTTTTAAGAAGGA XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----zz-x--z-y---z--z----yx------------zy---z-----------z-------- XG:Z:NN_ATGATCCTCGACACTGACTACATAACCGAGGATGGAAAGCCTGTCATAAGAATTTTCAAGAAGGA_AA . . . NS500355:NS500355:H5JJFAFXY:2:21206:18165:9780 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAGTTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:3:21402:17601:18156 0 TS559_Genomic_Sequence.seq 2087060 255 46M30S 0 0 TTAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:-----------zy---------z--z-------------------- XG:Z:GT_TTAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN NS500355:NS500355:H5JJFAFXY:4:21405:22021:1321 0 TS559_Genomic_Sequence.seq 2087061 255 45M31S 0 0 TAGAATAAAGTTTGGATTGTTTTATAAGATTATGGGGGATGAAAGATGATTTTTGATATTGATTATATAATTGAGG XO:Z:+FW XS:i:0 NM:i:0 XM:Z:----------zy---------z--z-------------------- XG:Z:TT_TAGAATAAAGCCTGGATTGTTCTACAAGATTATGGGGGATGAAAG_NN
head of gff file
gff-version 3
TS559_Genomic_Sequence.seq manual region 1 2088737 . + . gene_id=id0;Dbxref=taxon:69014;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;note=synonym: Thermococcus kodakaraensis TS559;strain=TS559 TS559_Genomic_Sequence.seq BLATSn exon 1 5016 . + 0 gene_id=TK0001;gene_biotype=protein_coding;locus_tag=TK0001 TS559_Genomic_Sequence.seq BLATSn exon 5086 5733 . + 0 gene_id=TK0002;gene_biotype=protein_coding;locus_tag=TK0002 TS559_Genomic_Sequence.seq BLATSn exon 5730 6119 . + 0 gene_id=TK0003;gene_biotype=protein_coding;locus_tag=TK0003 TS559_Genomic_Sequence.seq BLATSn exon 6079 6528 . - 0 gene_id=TK0004;gene_biotype=protein_coding;locus_tag=TK0004 TS559_Genomic_Sequence.seq BLATSn exon 6586 7014 . + 0 gene_id=TK0005;gene_biotype=protein_coding;locus_tag=TK0005 TS559_Genomic_Sequence.seq BLATSn exon 7152 7427 . - 0 gene_id=TK0006;gene_biotype=protein_coding;locus_tag=TK0006