mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

There are not enough reads to estimate fragment length. #139

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hello, thank you very much for this amazing tool. I get this error when running TEcounts in a dataset, which I haven't seen in other datasets.

TEcount --stranded no -b /data_lu/users/shinminkyu1_ct7/GSE165252//SRR13521096_T/1pass_SRR13521096_T/Aligned.out.bam --GTF /data_lu/users/shinminkyu1_ct7/hg38_gencode.v34.primary_assembly.annotation.gtf --TE /data_lu/users/shinminkyu1_ct7/GRCh38_GENCODE_rmsk_TE.gtf --outdir /data_lu/users/shinminkyu1_ct7/GSE165252//SRR13521096_T/1pass_SRR13521096_T
INFO  @ Sun, 21 May 2023 01:02:49:
ARGUMENTS LIST:
name = TEtranscripts_out
BAM file = /data_lu/users/shinminkyu1_ct7/GSE165252//SRR13521096_T/1pass_SRR13521096_T/Aligned.out.bam
GTF file = /data_lu/users/shinminkyu1_ct7/hg38_gencode.v34.primary_assembly.annotation.gtf
TE file = /data_lu/users/shinminkyu1_ct7/GRCh38_GENCODE_rmsk_TE.gtf
multi-mapper mode = multi
stranded = no
number of iteration = 100
Alignments grouped by read ID = True
INFO  @ Sun, 21 May 2023 01:02:49: Processing GTF files ...
INFO  @ Sun, 21 May 2023 01:02:49: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
INFO  @ Sun, 21 May 2023 01:12:55: Done building gene index ......
INFO  @ Sun, 21 May 2023 01:13:14: Building TE index .......
INFO  @ Sun, 21 May 2023 01:16:56: Done building TE index ......
INFO  @ Sun, 21 May 2023 01:16:56:
Reading sample file ...
[E::idx_find_and_load] Could not retrieve index file for '/data_lu/users/shinminkyu1_ct7/GSE165252//SRR13521096_T/1pass_SRR13521096_T/Aligned.out.bam'
1000000 alignments  processed.
3000000 alignments  processed.
4000000 alignments  processed.
5000000 alignments  processed.
6000000 alignments  processed.
7000000 alignments  processed.
8000000 alignments  processed.
10000000 alignments  processed.
12000000 alignments  processed.
17000000 alignments  processed.
21000000 alignments  processed.
22000000 alignments  processed.
23000000 alignments  processed.
24000000 alignments  processed.
25000000 alignments  processed.
26000000 alignments  processed.
30000000 alignments  processed.
35000000 alignments  processed.
36000000 alignments  processed.
37000000 alignments  processed.
38000000 alignments  processed.
39000000 alignments  processed.
40000000 alignments  processed.
41000000 alignments  processed.
46000000 alignments  processed.
50000000 alignments  processed.
uniq te counts = 3004318
.......start iterative optimization ..........
There are not enough reads to estimate fragment length.
Error in optimization
Error occurred when assigning read gene/TE annotations in file /data_lu/users/shinminkyu1_ct7/GSE165252//SRR13521096_T/1pass_SRR13521096_T/Aligned.out.bam.
Error: No active exception to reraise

But the input BAM file, which was aligned using STAR as I did in other datasets, seems to have an enough number of mapped reads.

                         Number of input reads |        32390596
                     Average input read length |        58
                                   UNIQUE READS:
                  Uniquely mapped reads number |        25693002
                       Uniquely mapped reads % |        79.32%
                         Average mapped length |        50.81
                      Number of splices: Total |        3281505
           Number of splices: Annotated (sjdb) |        3255041
                      Number of splices: GT/AG |        3252671
                      Number of splices: GC/AG |        21430
                      Number of splices: AT/AC |        2582
              Number of splices: Non-canonical |        4822
                     Mismatch rate per base, % |        0.21%
                        Deletion rate per base |        0.01%
                       Deletion average length |        1.35
                       Insertion rate per base |        0.00%
                      Insertion average length |        1.19
                            MULTI-MAPPING READS:
       Number of reads mapped to multiple loci |        6170388
            % of reads mapped to multiple loci |        19.05%
       Number of reads mapped to too many loci |        259
            % of reads mapped to too many loci |        0.00%
                                 UNMAPPED READS:
      % of reads unmapped: too many mismatches |        0.00%
                % of reads unmapped: too short |        1.37%
                    % of reads unmapped: other |        0.25%
                                 CHIMERIC READS:
                      Number of chimeric reads |        0
                           % of chimeric reads |        0.00%

Could you please help me find out a solution for this problem?

Here is the header of BAM file:

@HD     VN:1.4
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@SQ     SN:chrM LN:16569
@SQ     SN:GL000008.2   LN:209709
@SQ     SN:GL000009.2   LN:201709
@SQ     SN:GL000194.1   LN:191469
@SQ     SN:GL000195.1   LN:182896
@SQ     SN:GL000205.2   LN:185591
@SQ     SN:GL000208.1   LN:92689
@SQ     SN:GL000213.1   LN:164239
@SQ     SN:GL000214.1   LN:137718
@SQ     SN:GL000216.2   LN:176608
@SQ     SN:GL000218.1   LN:161147
@SQ     SN:GL000219.1   LN:179198
@SQ     SN:GL000220.1   LN:161802
@SQ     SN:GL000221.1   LN:155397
@SQ     SN:GL000224.1   LN:179693
@SQ     SN:GL000225.1   LN:211173
@SQ     SN:GL000226.1   LN:15008
@SQ     SN:KI270302.1   LN:2274
@SQ     SN:KI270303.1   LN:1942
@SQ     SN:KI270304.1   LN:2165
@SQ     SN:KI270305.1   LN:1472
@SQ     SN:KI270310.1   LN:1201
@SQ     SN:KI270311.1   LN:12399
@SQ     SN:KI270312.1   LN:998
@SQ     SN:KI270315.1   LN:2276
@SQ     SN:KI270316.1   LN:1444
@SQ     SN:KI270317.1   LN:37690
@SQ     SN:KI270320.1   LN:4416
@SQ     SN:KI270322.1   LN:21476
@SQ     SN:KI270329.1   LN:1040
@SQ     SN:KI270330.1   LN:1652
@SQ     SN:KI270333.1   LN:2699
@SQ     SN:KI270334.1   LN:1368
@SQ     SN:KI270335.1   LN:1048
@SQ     SN:KI270336.1   LN:1026
@SQ     SN:KI270337.1   LN:1121
@SQ     SN:KI270338.1   LN:1428
@SQ     SN:KI270340.1   LN:1428
@SQ     SN:KI270362.1   LN:3530
@SQ     SN:KI270363.1   LN:1803
@SQ     SN:KI270364.1   LN:2855
@SQ     SN:KI270366.1   LN:8320
@SQ     SN:KI270371.1   LN:2805
@SQ     SN:KI270372.1   LN:1650
@SQ     SN:KI270373.1   LN:1451
@SQ     SN:KI270374.1   LN:2656
@SQ     SN:KI270375.1   LN:2378
@SQ     SN:KI270376.1   LN:1136
@SQ     SN:KI270378.1   LN:1048
@SQ     SN:KI270379.1   LN:1045
@SQ     SN:KI270381.1   LN:1930
@SQ     SN:KI270382.1   LN:4215
@SQ     SN:KI270383.1   LN:1750
@SQ     SN:KI270384.1   LN:1658
@SQ     SN:KI270385.1   LN:990
@SQ     SN:KI270386.1   LN:1788
@SQ     SN:KI270387.1   LN:1537
@SQ     SN:KI270388.1   LN:1216
@SQ     SN:KI270389.1   LN:1298
@SQ     SN:KI270390.1   LN:2387
@SQ     SN:KI270391.1   LN:1484
@SQ     SN:KI270392.1   LN:971
@SQ     SN:KI270393.1   LN:1308
@SQ     SN:KI270394.1   LN:970
@SQ     SN:KI270395.1   LN:1143
@SQ     SN:KI270396.1   LN:1880
@SQ     SN:KI270411.1   LN:2646
@SQ     SN:KI270412.1   LN:1179
@SQ     SN:KI270414.1   LN:2489
@SQ     SN:KI270417.1   LN:2043
@SQ     SN:KI270418.1   LN:2145
@SQ     SN:KI270419.1   LN:1029
@SQ     SN:KI270420.1   LN:2321
@SQ     SN:KI270422.1   LN:1445
@SQ     SN:KI270423.1   LN:981
@SQ     SN:KI270424.1   LN:2140
@SQ     SN:KI270425.1   LN:1884
@SQ     SN:KI270429.1   LN:1361
@SQ     SN:KI270435.1   LN:92983
@SQ     SN:KI270438.1   LN:112505
@SQ     SN:KI270442.1   LN:392061
@SQ     SN:KI270448.1   LN:7992
@SQ     SN:KI270465.1   LN:1774
@SQ     SN:KI270466.1   LN:1233
@SQ     SN:KI270467.1   LN:3920
@SQ     SN:KI270468.1   LN:4055
@SQ     SN:KI270507.1   LN:5353
@SQ     SN:KI270508.1   LN:1951
@SQ     SN:KI270509.1   LN:2318
@SQ     SN:KI270510.1   LN:2415
@SQ     SN:KI270511.1   LN:8127
@SQ     SN:KI270512.1   LN:22689
@SQ     SN:KI270515.1   LN:6361
@SQ     SN:KI270516.1   LN:1300
@SQ     SN:KI270517.1   LN:3253
@SQ     SN:KI270518.1   LN:2186
@SQ     SN:KI270519.1   LN:138126
@SQ     SN:KI270521.1   LN:7642
@SQ     SN:KI270522.1   LN:5674
@SQ     SN:KI270528.1   LN:2983
@SQ     SN:KI270529.1   LN:1899
@SQ     SN:KI270530.1   LN:2168
@SQ     SN:KI270538.1   LN:91309
@SQ     SN:KI270539.1   LN:993
@SQ     SN:KI270544.1   LN:1202
@SQ     SN:KI270548.1   LN:1599
@SQ     SN:KI270579.1   LN:31033
@SQ     SN:KI270580.1   LN:1553
@SQ     SN:KI270581.1   LN:7046
@SQ     SN:KI270582.1   LN:6504
@SQ     SN:KI270583.1   LN:1400
@SQ     SN:KI270584.1   LN:4513
@SQ     SN:KI270587.1   LN:2969
@SQ     SN:KI270588.1   LN:6158
@SQ     SN:KI270589.1   LN:44474
@SQ     SN:KI270590.1   LN:4685
@SQ     SN:KI270591.1   LN:5796
@SQ     SN:KI270593.1   LN:3041
@SQ     SN:KI270706.1   LN:175055
@SQ     SN:KI270707.1   LN:32032
@SQ     SN:KI270708.1   LN:127682
@SQ     SN:KI270709.1   LN:66860
@SQ     SN:KI270710.1   LN:40176
@SQ     SN:KI270711.1   LN:42210
@SQ     SN:KI270712.1   LN:176043
@SQ     SN:KI270713.1   LN:40745
@SQ     SN:KI270714.1   LN:41717
@SQ     SN:KI270715.1   LN:161471
@SQ     SN:KI270716.1   LN:153799
@SQ     SN:KI270717.1   LN:40062
@SQ     SN:KI270718.1   LN:38054
@SQ     SN:KI270719.1   LN:176845
@SQ     SN:KI270720.1   LN:39050
@SQ     SN:KI270721.1   LN:100316
@SQ     SN:KI270722.1   LN:194050
@SQ     SN:KI270723.1   LN:38115
@SQ     SN:KI270724.1   LN:39555
@SQ     SN:KI270725.1   LN:172810
@SQ     SN:KI270726.1   LN:43739
@SQ     SN:KI270727.1   LN:448248
@SQ     SN:KI270728.1   LN:1872759
@SQ     SN:KI270729.1   LN:280839
@SQ     SN:KI270730.1   LN:112551
@SQ     SN:KI270731.1   LN:150754
@SQ     SN:KI270732.1   LN:41543
@SQ     SN:KI270733.1   LN:179772
@SQ     SN:KI270734.1   LN:165050
@SQ     SN:KI270735.1   LN:42811
@SQ     SN:KI270736.1   LN:181920
@SQ     SN:KI270737.1   LN:103838
@SQ     SN:KI270738.1   LN:99375
@SQ     SN:KI270739.1   LN:73985
@SQ     SN:KI270740.1   LN:37240
@SQ     SN:KI270741.1   LN:157432
@SQ     SN:KI270742.1   LN:186739
@SQ     SN:KI270743.1   LN:210658
@SQ     SN:KI270744.1   LN:168472
@SQ     SN:KI270745.1   LN:41891
@SQ     SN:KI270746.1   LN:66486
@SQ     SN:KI270747.1   LN:198735
@SQ     SN:KI270748.1   LN:93321
@SQ     SN:KI270749.1   LN:158759
@SQ     SN:KI270750.1   LN:148850
@SQ     SN:KI270751.1   LN:150742
@SQ     SN:KI270752.1   LN:27745
@SQ     SN:KI270753.1   LN:62944
@SQ     SN:KI270754.1   LN:40191
@SQ     SN:KI270755.1   LN:36723
@SQ     SN:KI270756.1   LN:79590
@SQ     SN:KI270757.1   LN:71251
olivertam commented 1 year ago

Hi,

Thank you for your interest in the software. Could you provide the first 1,000,000 lines of your BAM file?

Thanks.

ghost commented 1 year ago

Thank you for the quick response. Due to the upload limit of file size, here is the compressed file of the first 900,000 lines of the BAM file.

SRR13521096_BAM_900k.txt.gz

olivertam commented 1 year ago

Hi,

I noticed that the alignments you provided are all paired-end reads with unmapped mates. This is not typically what we see from STAR's paired end reads. This would explain the error, as TEcount is trying to estimate fragment length assuming that it's a paired-end alignment, but the other end is unmapped (and thus fragment length is undeterminable).

Could you either double-check your STAR command line, or provide that here?

If you run the following, do you get any output?

samtools view -F 4 -F 8 [BAM] > [output file]

Thanks

ghost commented 1 year ago

Appreciate your kind help. There is no output from the code you asked, and here are the STAR command line I runned and the log from it - there was no problem when running it and TEcounts in other datasets. Actually, there was one strange point I noticed from this dataset - the size of read1 files are approximately twice the size of read2 files.

/data/tools/STAR-2.5.2b/bin/Linux_x86_64/STAR \ --outSAMtype BAM Unsorted \ --runThreadN 4 \ --genomeDir /data_lu/users/shinminkyu1_ct7/STAR_hg38 \ --genomeLoad LoadAndRemove \ --outFilterMultimapNmax 100 \ --winAnchorMultimapNmax 100 \ --readFilesIn /data_lu/users/shinminkyu1_ct7/GSE165252_BAM//SRR13521096_T/SRR13521096_T_R1.fastq /data_lu/users/shinminkyu1_ct7/GSE165252_BAM//SRR13521096_T/SRR13521096_T_R2.fastq 2>&1| [Wed May 10 21:46:29 KST 2023] May 10 21:46:29 ..... started STAR run [Wed May 10 21:46:29 KST 2023] May 10 21:46:29 ..... loading genome [Wed May 10 21:46:48 KST 2023] May 10 21:46:48 ..... started mapping [Wed May 10 21:50:21 KST 2023] May 10 21:50:21 ..... finished successfully [Wed May 10 21:50:22 KST 2023] No other jobs are attached to the shared memory segment, removing it. [Wed May 10 21:50:24 KST 2023] Elapsed: 00:03:55

Log.txt

olivertam commented 1 year ago

Hi,

That is very strange, as it would mean that there were more reads in one direction than the other. I just quickly downloaded SRR13521096, and noticed that while Read 1 is 58nt, Read 2 is only 7nt in length (and all Ns). I wonder if they uploaded the wrong file. You can see the read length distribution here.

In fact, I noticed that this is the case for all the files in this project, SRX9932092.

These might be best treated as single-end libraries.

Sorry that I couldn't be of more help.

Thanks.

ghost commented 1 year ago

Okay, I understand the problem. I should have first checked the fastq files.. Sorry to bother you and thank you so much for your great help and suggestion.

Really appreciate your efforts in this area. Wish you a good day.