Closed mmilevskiy closed 4 years ago
Hi, Which genome build are you using? What is the command line that you're using? Would you be able to provide an excerpt of your BAM file? Including the headers would be great. Thanks.
Hi @olivertam
Thanks for the swift reply! I ran the test data to check that everything is working and there are definitely counts in the test analysis.
I am using mm10, running TEcount through anaconda2 as follows:
STAR --genomeDir /REF/Mouse-mm10-release81/STAR/ --runThreadN 6 --readFilesIn Read1.fastq Read2.fastq --outFileNamePrefix STAR/Bam_file -Star --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100
TEcount --format BAM --mode multi -b STAR/Bam_fileStarAligned.sortedByCoord.out.bam --project TE_Counts --GTF mm10_ensemble86_Basic.gtf --TE mm10_rmsk_TE.gtf --sortByPos
I haven't truncated a bam file before but did this: samtools view -h STAR/Bam_fileStarAligned.sortedByCoord.out.bam | head -n 500000 | samtools view -bS - > little.bam
Your help is much appreciated!
Hi @mmilevskiy,
It appears that your BAM file was mapped to a genome that used "1" instead of "chr1". You can see it in the header (SQ lines):
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:10 LN:130694993
@SQ SN:11 LN:122082543
@SQ SN:12 LN:120129022
@SQ SN:13 LN:120421639
@SQ SN:14 LN:124902244
@SQ SN:15 LN:104043685
...
and also in the alignment (3rd column)
NB500916:591:H3V5NBGXC:4:11508:8170:12391 419 1 3031476 0 80M = 3031575 180 GAAGCAAGAGAGAGAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATT AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE/EE NH:i:67 HI:i:64 AS:i:159 nM:i:0
NB500916:591:H3V5NBGXC:4:13406:20130:4217 419 1 3031476 0 80M = 3031575 180 GAAGCAAGAGAGAGAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATT AAAAAEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE NH:i:67 HI:i:64 AS:i:159 nM:i:0
NB500916:591:H3V5NBGXC:1:21204:2926:4168 419 1 3031489 0 80M1S = 3031554 146 GAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATTGGCTGCAGCCCATN AAAAAEEEEAE6EEEE/EE/AEE//EE6AEEEE/EE/AAE/EEEEEEEEAAEE<AEEEE/EEE/EEEEAE6EAEAE<<EA# NH:i:80 HI:i:77 AS:i:159 nM:i:0
In this case, the mm10 TE GTF won't be able to annotate correctly. Instead, you can use the GRCm38 version, which would hopefully work. I would also double-check your gene GTF file (which looks like it's from ENSEMBL), and make sure that it is using "1" instead of "chr1" for the chromosome. Otherwise, you might not get any gene annotations.
Thanks.
hi @olivertam
Good spotting!
I ran that on the server overnight and it failed for a different reason now. After removing the the "chr" from both GTF files there was an error:
INFO @ Thu, 01 Aug 2019 18:17:54: ARGUMENTS LIST: name = Ba1_RNA-TE BAM file = STAR/Ba1_RNA-StarAligned.sortedByCoord.out.bam GTF file = mm10_ensemble86_Basic_Fix.gtf TE file = mm10_rmsk_TE_Fix.gtf multi-mapper mode = multi stranded = yes number of iteration = 100 Alignments grouped by read ID = False
100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. 400000 GTF lines processed.
INFO @ Thu, 01 Aug 2019 18:25:20: Done building gene index ......
INFO @ Thu, 01 Aug 2019 18:26:18: Building TE index .......
Error in building gene/TE index
Old GFT: chr1 mm10_rmsk exon 100000467 100001474 7279 - . gene_id "RLTR13D5"; transcript_id "RLTR13D5_dup44"; family_id "ERVK"; class_id "LTR"; chr1 mm10_rmsk exon 100001476 100001763 902 - . gene_id "Lx9"; transcript_id "Lx9_dup2175"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100001800 100001980 1100 + . gene_id "B2_Mm2"; transcript_id "B2_Mm2_dup2770"; family_id "B2"; class_id "SINE"; chr1 mm10_rmsk exon 100002005 100002142 413 + . gene_id "ID_B1"; transcript_id "ID_B1_dup2911"; family_id "B4"; class_id "SINE"; chr1 mm10_rmsk exon 100002229 100002346 439 + . gene_id "B4A"; transcript_id "B4A_dup3449"; family_id "B4"; class_id "SINE"; chr1 mm10_rmsk exon 100002377 100003451 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup403"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100003484 100003677 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup404"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100003678 100003907 1004 - . gene_id "ORR1B2"; transcript_id "ORR1B2_dup365"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 mm10_rmsk exon 100004201 100004342 975 - . gene_id "B1_Mus1"; transcript_id "B1_Mus1_dup2926"; family_id "Alu"; class_id "SINE"; chr1 mm10_rmsk exon 100004604 100004869 673 - . gene_id "B4"; transcript_id "B4_dup2051"; family_id "B4"; class_id "SINE";
New GTF: 1 mm10_rmsk exon 100000467 100001474 7279 - . gene_id "RLTR13D5"; transcript_id "RLTR13D5_dup44"; family_id "ERVK"; class_id "LTR"; 1 mm10_rmsk exon 100001476 100001763 902 - . gene_id "Lx9"; transcript_id "Lx9_dup2175"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100001800 100001980 1100 + . gene_id "B2_Mm2"; transcript_id "B2_Mm2_dup2770"; family_id "B2"; class_id "SINE"; 1 mm10_rmsk exon 100002005 100002142 413 + . gene_id "ID_B1"; transcript_id "ID_B1_dup2911"; family_id "B4"; class_id "SINE"; 1 mm10_rmsk exon 100002229 100002346 439 + . gene_id "B4A"; transcript_id "B4A_dup3449"; family_id "B4"; class_id "SINE"; 1 mm10_rmsk exon 100002377 100003451 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup403"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100003484 100003677 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup404"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100003678 100003907 1004 - . gene_id "ORR1B2"; transcript_id "ORR1B2_dup365"; family_id "ERVL-MaLR"; class_id "LTR"; 1 mm10_rmsk exon 100004201 100004342 975 - . gene_id "B1_Mus1"; transcript_id "B1_Mus1_dup2926"; family_id "Alu"; class_id "SINE"; 1 mm10_rmsk exon 100004604 100004869 673 - . gene_id "B4"; transcript_id "B4_dup2051"; family_id "B4"; class_id "SINE";
Hi @mmilevskiy
Might I recommend using the GRCm38 version of the GTF? This is the version that is designed to handle the mouse genome (mm10) without the "chr" in the chromosome name. If this version does not work, I'd be happy to take a closer at your modified GTF.
Thanks.
Hi @olivertam, great will give that a try and let you know what happens.
Thanks.
Hi @olivertam, got it working! Thanks so much for your help.
Hi @olivertam
I am having the same issues as a previously solved error "Zero TE Counts" however my GTF files look fine, all of the Chr entries are "Chr1" and not "1" for both the gene and TE gtf files. Do you know what else might be causing this?
Any help is much appreciated.