mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

Exception type: RuntimeError, raised in TEtranscripts:448 #98

Closed ZeyanZhang closed 3 years ago

ZeyanZhang commented 3 years ago

Hi Oliver,

Thanks so much for your kind help to solve the previous problem! I am running TEtranscripts v2.2.1 (Python v3.7.2) for analysis of TE expression from RNA-seq. The parameter for bam files generating and the logs of TEtranscripts were as below. Can you please advise what's wrong? Thank you so much for your time!

align using STAR

STAR \
--runThreadN ${threads} \
--genomeDir hg38/STAR.gencode.v34 \
--genomeLoad NoSharedMemory \
--winAnchorMultimapNmax 100 \
--outFilterMultimapNmax 100 \
--readFilesCommand zcat \
--readFilesIn ${fastq_R1} ${fastq_R2} \
--outFileNamePrefix ${star_prefix} \
--outSAMtype BAM Unsorted \
--outStd BAM_Unsorted | \
samtools sort -m 32G -T ${sample}.samtools -o ${bam} -

The logs INFO of TEtranscripts

INFO  @ Tue, 24 Aug 2021 21:25:09: 
# ARGUMENTS LIST:
# name = TE
# treatment files = ['BAM-STAR/G20_shL1_rep1.bam', 'BAM-STAR/G20_shL1_rep2.bam', 'BAM-STAR/G20_shL2_rep1.bam', 'BAM-STAR/G20_shL2_rep2.bam', 'BAM-STAR/G20_shL3_rep1.bam', 'BAM-STAR/G20_shL3_rep2.bam']
# control files = ['BAM-STAR/G20_CTL_rep1.bam', 'BAM-STAR/G20_CTL_rep2.bam']
# GTF file = hg38.refGene.gtf 
# TE file = hg38_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = reverse
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 10
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Tue, 24 Aug 2021 21:25:09: Processing GTF files ...

INFO  @ Tue, 24 Aug 2021 21:25:09: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
INFO  @ Tue, 24 Aug 2021 21:28:32: Done building gene index ...... 

INFO  @ Tue, 24 Aug 2021 21:28:58: Building TE index ....... 

INFO  @ Tue, 24 Aug 2021 21:34:34: Done building TE index ...... 

INFO  @ Tue, 24 Aug 2021 21:34:34: 
Reading sample files ... 

uniq te counts = 0 
.......start iterative optimization .......... 
There are not enough reads to estimate fragment length. 
Error in optimization 
Error occurred when assigning read gene/TE annotations in file  BAM-STAR/G20_shL1_rep1.bam. 
Error: No active exception to reraise 
[Exception type: RuntimeError, raised in TEtranscripts:448] 

Best, Zeyan

olivertam commented 3 years ago

Hi Zeyan,

Looking at the log output, it appears that there were no TE reads in your library. One thing that I would check/confirm is that the chromosome names across all three files (BAM, Gene GTF and TE GTF) match up. I noticed that you're using Gencode hg38, but I just wanted to make sure. Also, I noticed that there were not enough reads to estimate fragment length. That could indicate that there were no alignments in the BAM file. It might be worth checking that too. If that doesn't explain the error, please feel free to share an excerpt of your BAM and GTF files, and I can take a closer look.

Thanks.

ZeyanZhang commented 3 years ago

@olivertam Hi Oliver, Thank you so much! The chromosome names are matched. Below are excerpts of the files. I can not figure out what's wrong, can you please have a look? Thanks again!

BAM file

samtools view -h BAM-STAR/G20_CTL_rep1.bam |head

@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
samtools view -h BAM-STAR/G20_CTL_rep1.bam | head -300 | tail

K00291:228:H3735BBXY:1:1201:29559:40332 355 chr1    14451   3   50M1S   =   14738   338 CTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGAAGT AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NH:i:2  HI:i:2  AS:i:97 nM:i:1
K00291:228:H3735BBXY:1:1201:29559:40332 355 chr1    14451   3   50M1S   =   14738   338 CTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGAAGT AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NH:i:2  HI:i:2  AS:i:97 nM:i:1
K00291:228:H3735BBXY:2:1128:14702:48386 355 chr1    14452   1   49M2S   =   14551   144 TCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGAAGTA AAFFFJFFFFJJJJJFFJJJJ<JJFJJJFFJFFJJJ<JJFFJJFFJ<AJFJ NH:i:4  HI:i:3  AS:i:90 nM:i:1
K00291:228:H3735BBXY:2:1128:14702:48386 355 chr1    14452   1   49M2S   =   14551   144 TCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGAAGTA AAFFFJFFFFJJJJJFFJJJJ<JJFJJJFFJFFJJJ<JJFFJJFFJ<AJFJ NH:i:4  HI:i:3  AS:i:90 nM:i:1
K00291:228:H3735BBXY:2:2120:2595:20340  355 chr1    14458   1   51M =   14615   208 GAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCA AAFFFJFJJJJJJJJFJJJJJJFJJFJJJJJJJJJJJJJJJJJJJFJFAJJ NH:i:4  HI:i:2  AS:i:94 nM:i:3
K00291:228:H3735BBXY:2:2120:2595:20340  355 chr1    14458   1   51M =   14615   208 GAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCA AAFFFJFJJJJJJJJFJJJJJJFJJFJJJJJJJJJJJJJJJJJJJFJFAJJ NH:i:4  HI:i:2  AS:i:94 nM:i:3
K00291:228:H3735BBXY:2:1116:19360:7679  355 chr1    14460   0   51M =   14553   144 ACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGA AAFFFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJF NH:i:6  HI:i:2  AS:i:96 nM:i:2
K00291:228:H3735BBXY:2:1116:19360:7679  355 chr1    14460   0   51M =   14553   144 ACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGA AAFFFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJF NH:i:6  HI:i:2  AS:i:96 nM:i:2
K00291:228:H3735BBXY:2:1124:30371:28024 403 chr1    14461   0   51M =   14400   -112    CACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGAC 7FAJFFAJF<FFAJJJJJJJ7<FFJJJJFJJFAJFJJJFA-F<AA7-<7-< NH:i:6  HI:i:4  AS:i:88 nM:i:1
K00291:228:H3735BBXY:2:1124:30371:28024 403 chr1    14461   0   51M =   14400   -112    CACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGAC 7FAJFFAJF<FFAJJJJJJJ7<FFJJJJFJJFAJFJJJFA-F<AA7-<7-< NH:i:6  HI:i:4  AS:i:88 nM:i:1
samtools view -H BAM-STAR/G20_CTL_rep1.bam  | tail

@SQ SN:chrUn_KI270755v1 LN:36723
@SQ SN:chrUn_KI270756v1 LN:79590
@SQ SN:chrUn_KI270757v1 LN:71251
@SQ SN:chrUn_GL000214v1 LN:137718
@SQ SN:chrUn_KI270742v1 LN:186739
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrEBV   LN:171823
@PG ID:STAR PN:STAR VN:2.7.3a   CL:STAR   --runThreadN 8   --genomeDir ref/hg38/STAR.gencode.v34   --genomeLoad NoSharedMemory   --readFilesIn FASTQ-TRIMMED/G20_CTL_rep1_R1.trim.fastq.gz   FASTQ-TRIMMED/G20_CTL_rep1_R2.trim.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix logs-align-star/G20_CTL_rep1_   --outStd BAM_Unsorted   --outSAMtype BAM   Unsorted      --outFilterMultimapNmax 100   --winAnchorMultimapNmax 100
@CO user command line: STAR --runThreadN 8 --genomeDir ref/hg38/STAR.gencode.v34 --genomeLoad NoSharedMemory --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --readFilesCommand zcat --readFilesIn FASTQ-TRIMMED/G20_CTL_rep1_R1.trim.fastq.gz FASTQ-TRIMMED/G20_CTL_rep1_R2.trim.fastq.gz --outFileNamePrefix logs-align-star/G20_CTL_rep1_ --outSAMtype BAM Unsorted --outStd BAM_Unsorted

gene GTF

head hg38.refGene.gtf 

chr1    refGene transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
chr1    refGene exon    11874   12227   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "1"; exon_id "NR_046018.1"; gene_name "DDX11L1";
chr1    refGene exon    12613   12721   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "2"; exon_id "NR_046018.2"; gene_name "DDX11L1";
chr1    refGene exon    13221   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "3"; exon_id "NR_046018.3"; gene_name "DDX11L1";
chr1    refGene transcript  14362   29370   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540";  gene_name "WASH7P";
chr1    refGene exon    14362   14829   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "1"; exon_id "NR_024540.1"; gene_name "WASH7P";
chr1    refGene exon    14970   15038   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "2"; exon_id "NR_024540.2"; gene_name "WASH7P";
chr1    refGene exon    15796   15947   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "3"; exon_id "NR_024540.3"; gene_name "WASH7P";
chr1    refGene exon    16607   16765   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "4"; exon_id "NR_024540.4"; gene_name "WASH7P";
chr1    refGene exon    16858   17055   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540"; exon_number "5"; exon_id "NR_024540.5"; gene_name "WASH7P";

TE GTF

head  hg38_rmsk_TE.gtf 

chr1    hg38_rmsk   exon    100000001   100000637   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup229"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000002    10000239    1760    +   .   gene_id "AluSx3"; transcript_id "AluSx3_dup157"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100000744   100002612   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup230"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    10000251    10000566    2225    +   .   gene_id "AluSx"; transcript_id "AluSx_dup700"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002613   100002913   1799    -   .   gene_id "AluJr"; transcript_id "AluJr_dup3513"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100002914   100003133   11325   -   .   gene_id "L1M2"; transcript_id "L1M2_dup231"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    100003134   100003444   2452    +   .   gene_id "AluSq2"; transcript_id "AluSq2_dup2641"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100003445   100004305   6882    -   .   gene_id "L1M2"; transcript_id "L1M2_dup232"; family_id "L1"; class_id "LINE";
chr1    hg38_rmsk   exon    100004306   100004578   1643    -   .   gene_id "AluJb"; transcript_id "AluJb_dup6336"; family_id "Alu"; class_id "SINE";
chr1    hg38_rmsk   exon    100004579   100004715   6794    -   .   gene_id "L1M2"; transcript_id "L1M2_dup233"; family_id "L1"; class_id "LINE";

Best, Zeyan

olivertam commented 3 years ago

Hi Zayan,

Would you mind sharing the BAM file with me (or at least all alignments from a chromosome)? I can't tell from the excerpt what might be wrong.

Thanks.

ZeyanZhang commented 3 years ago

Thanks, Oliver. Please see the link below for the BAM file. Please let me know if you can not open it.

https://www.dropbox.com/s/it17x5rgvo3lh7m/G20_CTL_rep1.bam?dl=0

Thanks again for your kind help, Zeyan

On Wed, Aug 25, 2021 at 11:44 AM Oliver Tam @.***> wrote:

Hi Zayan,

Would you mind sharing the BAM file with me (or at least all alignments from a chromosome)? I can't tell from the excerpt what might be wrong.

Thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/TEtranscripts/issues/98#issuecomment-905630387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJ6EK45U5LDZJCHIFGDQ7TT6UFULANCNFSM5CZEU64Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- ωǒ京ㄤ是め娲ネト天剩下DEζ疯狂de石头∞→◎

olivertam commented 3 years ago

I am able to reproduce your error. I'll start digging into the BAM file, and let you know what I found.

Thanks.

olivertam commented 3 years ago

Hi Zeyan,

There is something very weird with your BAM files. I extracted a bunch of uniquely aligned reads from chromosome 7, and sorted them by sequence name. I normally would expect them to be orientated in the F/R direction (e.g. SAM flag pairs of 82 and 163, or 99 and 147). However, what I see is that the both alignments with the same read name have the same orientation:

K00291:228:H3735BBXY:1:1101:1519:16471  83      chr7    128748792       255     13M5457N37M1S   =       128748731       -5568   AGGAAAGGCTTGGAAAGATTGTAAGTAAAATAGATGGCGACAAGGACGGGC     JJJJJJJJJJFJJJJJJJFJJJJJJJJAJ<AJJAJJJFA7FJJJJJFFFAA     NH:i:1  HI:i:1  AS:i:99 nM:i:0
K00291:228:H3735BBXY:1:1101:1519:16471  83      chr7    128748792       255     13M5457N37M1S   =       128748731       -5568   AGGAAAGGCTTGGAAAGATTGTAAGTAAAATAGATGGCGACAAGGACGGGC     JJJJJJJJJJFJJJJJJJFJJJJJJJJAJ<AJJAJJJFA7FJJJJJFFFAA     NH:i:1  HI:i:1  AS:i:99 nM:i:0
K00291:228:H3735BBXY:1:1101:1519:16471  163     chr7    128748731       255     50M     =       128748792       5568    GATGCCTTCTTGGGTGCTGAAGAAGCAAAGACCTTTGATCAGCTGACACC      AA<<AJJJFFJJJJJFJAAJJJFJJJJJ<FJJJJJJFJJJJJF7AJJFJF      NH:i:1  HI:i:1  AS:i:99 nM:i:0
K00291:228:H3735BBXY:1:1101:1519:16471  163     chr7    128748731       255     50M     =       128748792       5568    GATGCCTTCTTGGGTGCTGAAGAAGCAAAGACCTTTGATCAGCTGACACC      AA<<AJJJFFJJJJJFJAAJJJFJJJJJ<FJJJJJJFJJJJJF7AJJFJF      NH:i:1  HI:i:1  AS:i:99 nM:i:0
K00291:228:H3735BBXY:1:1101:1519:21676  99      chr7    5528495 255     51M     =       5529164 714     GCGCTCGGTGAGGATCTTCATGAGGTAGTCAGTCAGGTCCCGGCCAGCCAG     AAAFAFJJJFJJJFAJFJJJJJAFAJFJFJJJJJJJJJJFJJJJFJJJJAJ     NH:i:1  HI:i:1  AS:i:94 nM:i:0
K00291:228:H3735BBXY:1:1101:1519:21676  99      chr7    5528495 255     51M     =       5529164 714     GCGCTCGGTGAGGATCTTCATGAGGTAGTCAGTCAGGTCCCGGCCAGCCAG     AAAFAFJJJFJJJFAJFJJJJJAFAJFJFJJJJJJJJJJFJJJJFJJJJAJ     NH:i:1  HI:i:1  AS:i:94 nM:i:0

This has me wondering if the two FASTQ inputs are the same. You could check the sequences in the trimmed FASTQ file to ensure that they are not the same (and they shouldn't be, since you would only get this situation if your fragment is an inverted repeat). This observation would explain why TEcount failed, as it was trying to find "properly paired" reads (F/R orientation), and since it could not find any, failed to calculate an insert size and exited with error.

Let me know if this does not make sense.

Thanks.

ZeyanZhang commented 3 years ago

@olivertam Hi Oliver, Thank you so much!!! The R1 and R2 FASTQ are not the same as shown below. Yes, it's so wired. I will re-do alignment to see what's wrong. I will let you know how it goes.

R1

zcat FASTQ-TRIMMED/G20_CTL_rep1_R1.trim.fastq.gz |head

@K00291:228:H3735BBXY:1:1101:3599:1156 1:N:0:ATTCAGAA+TAAGATTA
TGTAAAAGCAGCACCTTTGTGACGTTTTAACTTTAGTATTCCTCTCCTTCT
+
AAFFFJJJFJJJJJJFJJJJJJJJFFJJJJJJJJJJAJJFJ7AJJJJJJAF
@K00291:228:H3735BBXY:1:1101:9019:1156 1:N:0:ATTCAGAN+TAAGATTA
GGTACCTCCAGCGCCCGAGCCGTCCAGGCGGCCAGCAGGAGCAGTGCCNAA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ
@K00291:228:H3735BBXY:1:1101:9080:1156 1:N:0:ATTCAGAN+TAAGATTA
GCTTGGTACTTATGCTGGGAGATATGCAAAATGTGTTTTTCAATGTTTNCT

R2

zcat FASTQ-TRIMMED/G20_CTL_rep1_R2.trim.fastq.gz |head

@K00291:228:H3735BBXY:1:1101:3599:1156 2:N:0:ATTCAGAA+TAAGATTA
GCCAAACGTGCCATTGTTCACTAGTACGATGGGGAGGAGAAGGAGGAAG
+
AA<AFF7JAFJF-77F-7FAF7-<-7A7<-7A<FJFA-7<-<FF<FJF<
@K00291:228:H3735BBXY:1:1101:9019:1156 2:N:0:ATTCAGAN+TAAGATTA
GCGCAGGGTCGCGATGCTGCCCGGTTTNGCACTNCTCCTGCTGGCCGNCT
+
AAFFFJJFJJJJJJJJJFAFJJJJFJJ#JJJJJ#J<FJFJJFFFAFF#JA
@K00291:228:H3735BBXY:1:1101:9080:1156 2:N:0:ATTCAGAN+TAAGATTA
GGCATTGATCCTAACAGCTTGCATTATNTGGTGNATCTGCTCAATCANGT

Thanks again. Best, Zeyan

ZeyanZhang commented 3 years ago

@olivertam

Hi Oliver, Thank you so much for the kind help! Yes, the problem is the BAM files. The problem was solved after doing re-alignment from my raw FASTQ files. Thanks again! Best, Zeyan