mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

TEcounts Zero Values Not Chr in GTF Error #43

Closed mmilevskiy closed 4 years ago

mmilevskiy commented 4 years ago

Hi @olivertam

I am having the same issues as a previously solved error "Zero TE Counts" however my GTF files look fine, all of the Chr entries are "Chr1" and not "1" for both the gene and TE gtf files. Do you know what else might be causing this?

Any help is much appreciated.

olivertam commented 4 years ago

Hi, Which genome build are you using? What is the command line that you're using? Would you be able to provide an excerpt of your BAM file? Including the headers would be great. Thanks.

mmilevskiy commented 4 years ago

Hi @olivertam

Thanks for the swift reply! I ran the test data to check that everything is working and there are definitely counts in the test analysis.

I am using mm10, running TEcount through anaconda2 as follows:

STAR --genomeDir /REF/Mouse-mm10-release81/STAR/ --runThreadN 6 --readFilesIn Read1.fastq Read2.fastq --outFileNamePrefix STAR/Bam_file -Star --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100

TEcount --format BAM --mode multi -b STAR/Bam_fileStarAligned.sortedByCoord.out.bam --project TE_Counts --GTF mm10_ensemble86_Basic.gtf --TE mm10_rmsk_TE.gtf --sortByPos

I haven't truncated a bam file before but did this: samtools view -h STAR/Bam_fileStarAligned.sortedByCoord.out.bam | head -n 500000 | samtools view -bS - > little.bam

little.bam.gz

Your help is much appreciated!

olivertam commented 4 years ago

Hi @mmilevskiy,

It appears that your BAM file was mapped to a genome that used "1" instead of "chr1". You can see it in the header (SQ lines):

@HD     VN:1.4  SO:coordinate
@SQ     SN:1    LN:195471971
@SQ     SN:10   LN:130694993
@SQ     SN:11   LN:122082543
@SQ     SN:12   LN:120129022
@SQ     SN:13   LN:120421639
@SQ     SN:14   LN:124902244
@SQ     SN:15   LN:104043685
...

and also in the alignment (3rd column)

NB500916:591:H3V5NBGXC:4:11508:8170:12391       419     1       3031476 0       80M     =       3031575 180     GAAGCAAGAGAGAGAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATT        AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE/EE        NH:i:67 HI:i:64 AS:i:159        nM:i:0
NB500916:591:H3V5NBGXC:4:13406:20130:4217       419     1       3031476 0       80M     =       3031575 180     GAAGCAAGAGAGAGAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATT        AAAAAEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE        NH:i:67 HI:i:64 AS:i:159        nM:i:0
NB500916:591:H3V5NBGXC:1:21204:2926:4168        419     1       3031489 0       80M1S   =       3031554 146     GAGAAAACGAAACCCCGTCCCTATTAAGGAGAATTTTCCTTCGCCTAGGACGTGTCACTCCCTGATTGGCTGCAGCCCATN       AAAAAEEEEAE6EEEE/EE/AEE//EE6AEEEE/EE/AAE/EEEEEEEEAAEE<AEEEE/EEE/EEEEAE6EAEAE<<EA#       NH:i:80 HI:i:77 AS:i:159        nM:i:0

In this case, the mm10 TE GTF won't be able to annotate correctly. Instead, you can use the GRCm38 version, which would hopefully work. I would also double-check your gene GTF file (which looks like it's from ENSEMBL), and make sure that it is using "1" instead of "chr1" for the chromosome. Otherwise, you might not get any gene annotations.

Thanks.

mmilevskiy commented 4 years ago

hi @olivertam

Good spotting!

I ran that on the server overnight and it failed for a different reason now. After removing the the "chr" from both GTF files there was an error:

INFO @ Thu, 01 Aug 2019 18:17:54: ARGUMENTS LIST: name = Ba1_RNA-TE BAM file = STAR/Ba1_RNA-StarAligned.sortedByCoord.out.bam GTF file = mm10_ensemble86_Basic_Fix.gtf TE file = mm10_rmsk_TE_Fix.gtf multi-mapper mode = multi stranded = yes number of iteration = 100 Alignments grouped by read ID = False

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. 400000 GTF lines processed.

INFO @ Thu, 01 Aug 2019 18:25:20: Done building gene index ......

INFO @ Thu, 01 Aug 2019 18:26:18: Building TE index .......

Error in building gene/TE index

Old GFT: chr1 mm10_rmsk exon 100000467 100001474 7279 - . gene_id "RLTR13D5"; transcript_id "RLTR13D5_dup44"; family_id "ERVK"; class_id "LTR"; chr1 mm10_rmsk exon 100001476 100001763 902 - . gene_id "Lx9"; transcript_id "Lx9_dup2175"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100001800 100001980 1100 + . gene_id "B2_Mm2"; transcript_id "B2_Mm2_dup2770"; family_id "B2"; class_id "SINE"; chr1 mm10_rmsk exon 100002005 100002142 413 + . gene_id "ID_B1"; transcript_id "ID_B1_dup2911"; family_id "B4"; class_id "SINE"; chr1 mm10_rmsk exon 100002229 100002346 439 + . gene_id "B4A"; transcript_id "B4A_dup3449"; family_id "B4"; class_id "SINE"; chr1 mm10_rmsk exon 100002377 100003451 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup403"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100003484 100003677 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup404"; family_id "L1"; class_id "LINE"; chr1 mm10_rmsk exon 100003678 100003907 1004 - . gene_id "ORR1B2"; transcript_id "ORR1B2_dup365"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 mm10_rmsk exon 100004201 100004342 975 - . gene_id "B1_Mus1"; transcript_id "B1_Mus1_dup2926"; family_id "Alu"; class_id "SINE"; chr1 mm10_rmsk exon 100004604 100004869 673 - . gene_id "B4"; transcript_id "B4_dup2051"; family_id "B4"; class_id "SINE";

New GTF: 1 mm10_rmsk exon 100000467 100001474 7279 - . gene_id "RLTR13D5"; transcript_id "RLTR13D5_dup44"; family_id "ERVK"; class_id "LTR"; 1 mm10_rmsk exon 100001476 100001763 902 - . gene_id "Lx9"; transcript_id "Lx9_dup2175"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100001800 100001980 1100 + . gene_id "B2_Mm2"; transcript_id "B2_Mm2_dup2770"; family_id "B2"; class_id "SINE"; 1 mm10_rmsk exon 100002005 100002142 413 + . gene_id "ID_B1"; transcript_id "ID_B1_dup2911"; family_id "B4"; class_id "SINE"; 1 mm10_rmsk exon 100002229 100002346 439 + . gene_id "B4A"; transcript_id "B4A_dup3449"; family_id "B4"; class_id "SINE"; 1 mm10_rmsk exon 100002377 100003451 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup403"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100003484 100003677 8074 - . gene_id "Lx3_Mus"; transcript_id "Lx3_Mus_dup404"; family_id "L1"; class_id "LINE"; 1 mm10_rmsk exon 100003678 100003907 1004 - . gene_id "ORR1B2"; transcript_id "ORR1B2_dup365"; family_id "ERVL-MaLR"; class_id "LTR"; 1 mm10_rmsk exon 100004201 100004342 975 - . gene_id "B1_Mus1"; transcript_id "B1_Mus1_dup2926"; family_id "Alu"; class_id "SINE"; 1 mm10_rmsk exon 100004604 100004869 673 - . gene_id "B4"; transcript_id "B4_dup2051"; family_id "B4"; class_id "SINE";

olivertam commented 4 years ago

Hi @mmilevskiy

Might I recommend using the GRCm38 version of the GTF? This is the version that is designed to handle the mouse genome (mm10) without the "chr" in the chromosome name. If this version does not work, I'd be happy to take a closer at your modified GTF.

Thanks.

mmilevskiy commented 4 years ago

Hi @olivertam, great will give that a try and let you know what happens.

Thanks.

mmilevskiy commented 4 years ago

Hi @olivertam, got it working! Thanks so much for your help.