count matrix contain >60000 genes, but only <1000 TE

qiangfan2022 commented 2 years ago

This is my code: TEtranscripts --sortByPos --format BAM --mode multi -t SRR13797132Aligned.sortedByCoord.out.bam SRR13797133Aligned.sortedByCoord.out.bam -c SRR13797134Aligned.sortedByCoord.out.bam SRR13797135Aligned.sortedByCoord.out.bam --GTF /home/dell/database/hg19/gencode.v39lift37.annotation.gtf --TE /home/dell/database/tetranscripts/GRCh37_GENCODE_rmsk_TE.gtf --project sample_sorted_test

I have performed mapping by STAR with hg19, and i have download the gtf from gencode for genes and gtf fot TE from https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/GRCh37_GENCODE_rmsk_TE.gtf.gz.

When run the code, the information just like the following:

INFO  @ Thu, 03 Feb 2022 01:08:59: 
# ARGUMENTS LIST:
# name = sample_sorted_test
# treatment files = ['SRR13797132Aligned.sortedByCoord.out.bam', 'SRR13797133Aligned.sortedByCoord.out.bam']
# control files = ['SRR13797134Aligned.sortedByCoord.out.bam', 'SRR13797135Aligned.sortedByCoord.out.bam']
# GTF file = /home/dell/database/hg19/gencode.v39lift37.annotation.gtf 
# TE file = /home/dell/database/tetranscripts/GRCh37_GENCODE_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Thu, 03 Feb 2022 01:08:59: Processing GTF files ...

INFO  @ Thu, 03 Feb 2022 01:08:59: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
1300000 GTF lines processed.
1400000 GTF lines processed.
1500000 GTF lines processed.
INFO  @ Thu, 03 Feb 2022 01:34:03: Done building gene index ...... 

INFO  @ Thu, 03 Feb 2022 01:34:12: Building TE index ....... 

INFO  @ Thu, 03 Feb 2022 01:43:51: Done building TE index ...... 

INFO  @ Thu, 03 Feb 2022 01:43:51: 
Reading sample files ... 

**[E::idx_find_and_load] Could not retrieve index file for '.1643823831.8598983.bam'**
1000000 alignments processed. 
2000000 alignments processed. 
3000000 alignments processed. 
4000000 alignments processed. 
uniq te counts = 184526 
.......start iterative optimization .......... 
multi-reads = 23169 total means = 204
after normalization total means0 = 1.000000000000034
SQUAREM iteraton [1]
.....................

After finish the process, the count matrix contain >60000 genes, but only <1000 TE, and i wonder how to resovle?

olivertam commented 2 years ago

Hi,

Are you saying that there are <1000 entries for TE? This is expected, as TEtranscripts aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome. On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match.

Thanks.

qiangfan2022 commented 2 years ago

And also reported the error: /home/dell/miniconda3/envs/tetranscripts/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory

olivertam commented 2 years ago

Hi,

This would be an error stemming from your R installation, rather than TEtranscripts. You might be able to find the solution here.

Thanks.

qiangfan2022 commented 2 years ago

Hi,

Are you saying that there are <1000 entries for TE? This is expected, as TEtranscripts aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome. On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match.

Thanks.

Thanks very much, this is the details of count matrix "ENSG00000289641.1_1" 0 0 0 0 "ENSG00000289642.1_1" 0 0 0 1 "ENSG00000289643.1_1" 0 0 0 0 "ENSG00000289644.1_1" 0 0 0 0 (CATTC)n:Satellite:Satellite 191 54 136 25 (GAATG)n:Satellite:Satellite 143 63 92 28 7SK:RNA:RNA 35 24 34 19 ACRO1:acro:Satellite 2 1 1 3 ALINE:RTE:LINE 5 3 3 0 ALR/Alpha:centr:Satellite 795 180 202 113 AluJb:Alu:SINE 6512 5683 2840 6973 AluJo:Alu:SINE 3909 3387 1558 3910 AluJr4:Alu:SINE 698 576 257 740 AluJr:Alu:SINE 3776 3288 1509 4188 AluSc5:Alu:SINE 315 221 139 313 AluSc8:Alu:SINE 1071 1064 515 1181 AluSc:Alu:SINE 1676 1535 770 1729 AluSg4:Alu:SINE 360 372 204 399 AluSg7:Alu:SINE 330 302 166 393 AluSg:Alu:SINE 2114 2113 1049 2300 AluSp:Alu:SINE 3143 3042 1515 3306 AluSq10:Alu:SINE 81 80 26 79 AluSq2:Alu:SINE 3237 2767 1484 3225 AluSq4:Alu:SINE 53 39 27 43 AluSq:Alu:SINE 1269 1153 608 1191 AluSx1:Alu:SINE 5818 5235 2668 5914 AluSx3:Alu:SINE 1375 1264 601 1402 AluSx4:Alu:SINE 286 280 123 270 AluSx:Alu:SINE 7041 6401 3333 7242 AluSz6:Alu:SINE 2288 1966 965 2375 AluSz:Alu:SINE 4956 4598 2328 5287 AluY:Alu:SINE 5121 4805 2606 5155 AluYa5:Alu:SINE 323 242 187 222 AluYa8:Alu:SINE 3 4 1 4 AluYb8:Alu:SINE 98 65 33 81 AluYb9:Alu:SINE 3 15 5 13 AluYc3:Alu:SINE 22 22 10 21 AluYc5:Alu:SINE 0 0 0 1 AluYc:Alu:SINE 190 208 120 208 AluYd8:Alu:SINE 1 2 0 3 AluYf4:Alu:SINE 48 37 23 61 AluYf5:Alu:SINE 2 1 1 0 AluYg6:Alu:SINE 21 22 13 33 AluYh9:Alu:SINE 4 6 1 2 AluYk11:Alu:SINE 9 6 4 8 AluYk12:Alu:SINE 81 75 43 52 AluYk4:Alu:SINE 100 81 49 75 AmnSINE1:Deu:SINE 14 10 1 9 AmnSINE2:Deu:SINE 4 0 1 3 Arthur1:hAT-Tip100:DNA 40 38 14 35 Arthur1A:hAT-Tip100:DNA 29 13 5 24 Arthur1B:hAT-Tip100:DNA 40 37 15 55 Arthur1C:hAT-Tip100:DNA 8 6 5 13 BLACKJACK:hAT-Blackjack:DNA 74 73 28 70 BSR/Beta:Satellite:Satellite 50 14 14 45 CER:Satellite:Satellite 8 6 1 3 CR1_Mam:CR1:LINE 156 138 69 142 Charlie10:hAT-Charlie:DNA 45 57 22 35 Charlie10a:hAT-Charlie:DNA 6 4 2 7 Charlie10b:hAT-Charlie:DNA 5 8 3 13 Charlie11:hAT-Charlie:DNA 4 3 0 1 Charlie12:hAT-Charlie:DNA 0 1 0 1 Charlie13a:hAT-Charlie:DNA 21 13 7 15 Charlie13b:hAT-Charlie:DNA 19 9 13 18 Charlie14a:hAT-Charlie:DNA 5 4 0 9 Charlie15a:hAT-Charlie:DNA 117 129 42 126 Charlie16a:hAT-Charlie:DNA 82 53 25 87 Charlie17a:hAT-Charlie:DNA 77 100 20 111 Charlie18a:hAT-Charlie:DNA 94 73 31 82 Charlie19a:hAT-Charlie:DNA 68 65 25 100 Charlie1:hAT-Charlie:DNA 205 188 86 181 Charlie1a:hAT-Charlie:DNA 350 261 140 323 Charlie1b:hAT-Charlie:DNA 123 125 50 121 Charlie1b_Mars:hAT-Charlie:DNA 0 0 0 0 Charlie20a:hAT-Charlie:DNA 22 22 7 32 Charlie21a:hAT-Charlie:DNA 38 26 9 59 Charlie22a:hAT-Charlie:DNA 26 29 12 25 Charlie23a:hAT-Charlie:DNA 36 28 10 34 Charlie24:hAT-Charlie:DNA 40 41 11 45 Charlie25:hAT-Charlie:DNA 20 18 11 17 Charlie26a:hAT-Charlie:DNA 3 6 2 11 Charlie2a:hAT-Charlie:DNA 200 212 108 213 Charlie2b:hAT-Charlie:DNA 205 187 68 179 Charlie3:hAT-Charlie:DNA 13 11 3 13 Charlie4:hAT-Charlie:DNA 8 5 2 5 Charlie4a:hAT-Charlie:DNA 121 113 50 115 Charlie4z:hAT-Charlie:DNA 149 109 39 155 Charlie5:hAT-Charlie:DNA 159 121 55 136 Charlie6:hAT-Charlie:DNA 8 7 2 6 Charlie7:hAT-Charlie:DNA 83 65 33 104 Charlie7a:hAT-Charlie:DNA 20 21 10 28 Charlie8:hAT-Charlie:DNA 214 186 62 243 Charlie9:hAT-Charlie:DNA 27 29 7 46 CheshMITE:hAT-Charlie:DNA 0 0 0 0 Cheshire:hAT-Charlie:DNA 21 21 7 23 CheshireMars:hAT-Charlie:DNA 0 0 0 0 D20S16:Satellite:Satellite 2 4 0 8 DNA1_Mam:TcMar:DNA 3 4 2 5 ERV3-16A3_I-int:ERVL:LTR 281 306 78 376 ERV3-16A3_LTR:ERVL:LTR 23 16 9 28 ERVL-B4-int:ERVL:LTR 125 120 72 151 ERVL-E-int:ERVL:LTR 262 198 79 268 ERVL-int:ERVL:LTR 22 10 13 14 Eulor10:DNA:DNA? 0 0 0 0 Eulor11:DNA:DNA 0 0 1 0 Eulor12:DNA:DNA? 1 0 0 1 Eulor1:DNA:DNA 1 1 1 2

And the results seem that TE count ()gene id) aggregate based on transcripts but not the subfamilies. according to the TE gtf;: chr1 hg19_rmsk exon 100004963 100005398 2754 - . gene_id "Tigger2a"; transcript_id "Tigger2a_dup99"; family_id "TcMar-Tigger"; class_id "DNA"; chr1 hg19_rmsk exon 10000540 10000674 1079 - . gene_id "AluY"; transcript_id "AluY_dup614"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100005498 100005720 1007 + . gene_id "MLT1C"; transcript_id "MLT1C_dup643"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100005720 100006238 3636 - . gene_id "MER1A"; transcript_id "MER1A_dup101"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 100006239 100006294 341 + . gene_id "MLT1C"; transcript_id "MLT1C_dup644"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100006551 100007134 767 - . gene_id "L1MEg"; transcript_id "L1MEg_dup563"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007242 100007566 245 - . gene_id "L1MEg"; transcript_id "L1MEg_dup564"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007589 100007893 1999 - . gene_id "AluJr"; transcript_id "AluJr_dup3116"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100008691 100008998 2263 - . gene_id "AluSc5"; transcript_id "AluSc5_dup264"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100009353 100009528 748 + . gene_id "MER5B"; transcript_id "MER5B_dup632"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 10001028 10001188 277 + . gene_id "L2c"; transcript_id "L2c_dup255"; family_id "L2"; class_id "LINE"; chr1 hg19_rmsk exon 100010663 100010883 390 - . gene_id "MIR"; transcript_id "MIR_dup8810"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100011584 100011894 2220 + . gene_id "AluSz"; transcript_id "AluSz_dup4266"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 10001248 10001552 2167 - . gene_id "AluSx1"; transcript_id "AluSx1_dup548"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100013022 100013237 505 + . gene_id "MIRc"; transcript_id "MIRc_dup5614"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100013916 100014050 354 - . gene_id "L1M5"; transcript_id "L1M5_dup2005"; family_id "L1"; class_id "LINE";

qiangfan2022 commented 2 years ago

Hi, Are you saying that there are <1000 entries for TE? This is expected, as TEtranscripts aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome. On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match. Thanks.

Thanks very much, this is the details of count matrix "ENSG00000289641.1_1" 0 0 0 0 "ENSG00000289642.1_1" 0 0 0 1 "ENSG00000289643.1_1" 0 0 0 0 "ENSG00000289644.1_1" 0 0 0 0 (CATTC)n:Satellite:Satellite 191 54 136 25 (GAATG)n:Satellite:Satellite 143 63 92 28 7SK:RNA:RNA 35 24 34 19 ACRO1:acro:Satellite 2 1 1 3 ALINE:RTE:LINE 5 3 3 0 ALR/Alpha:centr:Satellite 795 180 202 113 AluJb:Alu:SINE 6512 5683 2840 6973 AluJo:Alu:SINE 3909 3387 1558 3910 AluJr4:Alu:SINE 698 576 257 740 AluJr:Alu:SINE 3776 3288 1509 4188 AluSc5:Alu:SINE 315 221 139 313 AluSc8:Alu:SINE 1071 1064 515 1181 AluSc:Alu:SINE 1676 1535 770 1729 AluSg4:Alu:SINE 360 372 204 399 AluSg7:Alu:SINE 330 302 166 393 AluSg:Alu:SINE 2114 2113 1049 2300 AluSp:Alu:SINE 3143 3042 1515 3306 AluSq10:Alu:SINE 81 80 26 79 AluSq2:Alu:SINE 3237 2767 1484 3225 AluSq4:Alu:SINE 53 39 27 43 AluSq:Alu:SINE 1269 1153 608 1191 AluSx1:Alu:SINE 5818 5235 2668 5914 AluSx3:Alu:SINE 1375 1264 601 1402 AluSx4:Alu:SINE 286 280 123 270 AluSx:Alu:SINE 7041 6401 3333 7242 AluSz6:Alu:SINE 2288 1966 965 2375 AluSz:Alu:SINE 4956 4598 2328 5287 AluY:Alu:SINE 5121 4805 2606 5155 AluYa5:Alu:SINE 323 242 187 222 AluYa8:Alu:SINE 3 4 1 4 AluYb8:Alu:SINE 98 65 33 81 AluYb9:Alu:SINE 3 15 5 13 AluYc3:Alu:SINE 22 22 10 21 AluYc5:Alu:SINE 0 0 0 1 AluYc:Alu:SINE 190 208 120 208 AluYd8:Alu:SINE 1 2 0 3 AluYf4:Alu:SINE 48 37 23 61 AluYf5:Alu:SINE 2 1 1 0 AluYg6:Alu:SINE 21 22 13 33 AluYh9:Alu:SINE 4 6 1 2 AluYk11:Alu:SINE 9 6 4 8 AluYk12:Alu:SINE 81 75 43 52 AluYk4:Alu:SINE 100 81 49 75 AmnSINE1:Deu:SINE 14 10 1 9 AmnSINE2:Deu:SINE 4 0 1 3 Arthur1:hAT-Tip100:DNA 40 38 14 35 Arthur1A:hAT-Tip100:DNA 29 13 5 24 Arthur1B:hAT-Tip100:DNA 40 37 15 55 Arthur1C:hAT-Tip100:DNA 8 6 5 13 BLACKJACK:hAT-Blackjack:DNA 74 73 28 70 BSR/Beta:Satellite:Satellite 50 14 14 45 CER:Satellite:Satellite 8 6 1 3 CR1_Mam:CR1:LINE 156 138 69 142 Charlie10:hAT-Charlie:DNA 45 57 22 35 Charlie10a:hAT-Charlie:DNA 6 4 2 7 Charlie10b:hAT-Charlie:DNA 5 8 3 13 Charlie11:hAT-Charlie:DNA 4 3 0 1 Charlie12:hAT-Charlie:DNA 0 1 0 1 Charlie13a:hAT-Charlie:DNA 21 13 7 15 Charlie13b:hAT-Charlie:DNA 19 9 13 18 Charlie14a:hAT-Charlie:DNA 5 4 0 9 Charlie15a:hAT-Charlie:DNA 117 129 42 126 Charlie16a:hAT-Charlie:DNA 82 53 25 87 Charlie17a:hAT-Charlie:DNA 77 100 20 111 Charlie18a:hAT-Charlie:DNA 94 73 31 82 Charlie19a:hAT-Charlie:DNA 68 65 25 100 Charlie1:hAT-Charlie:DNA 205 188 86 181 Charlie1a:hAT-Charlie:DNA 350 261 140 323 Charlie1b:hAT-Charlie:DNA 123 125 50 121 Charlie1b_Mars:hAT-Charlie:DNA 0 0 0 0 Charlie20a:hAT-Charlie:DNA 22 22 7 32 Charlie21a:hAT-Charlie:DNA 38 26 9 59 Charlie22a:hAT-Charlie:DNA 26 29 12 25 Charlie23a:hAT-Charlie:DNA 36 28 10 34 Charlie24:hAT-Charlie:DNA 40 41 11 45 Charlie25:hAT-Charlie:DNA 20 18 11 17 Charlie26a:hAT-Charlie:DNA 3 6 2 11 Charlie2a:hAT-Charlie:DNA 200 212 108 213 Charlie2b:hAT-Charlie:DNA 205 187 68 179 Charlie3:hAT-Charlie:DNA 13 11 3 13 Charlie4:hAT-Charlie:DNA 8 5 2 5 Charlie4a:hAT-Charlie:DNA 121 113 50 115 Charlie4z:hAT-Charlie:DNA 149 109 39 155 Charlie5:hAT-Charlie:DNA 159 121 55 136 Charlie6:hAT-Charlie:DNA 8 7 2 6 Charlie7:hAT-Charlie:DNA 83 65 33 104 Charlie7a:hAT-Charlie:DNA 20 21 10 28 Charlie8:hAT-Charlie:DNA 214 186 62 243 Charlie9:hAT-Charlie:DNA 27 29 7 46 CheshMITE:hAT-Charlie:DNA 0 0 0 0 Cheshire:hAT-Charlie:DNA 21 21 7 23 CheshireMars:hAT-Charlie:DNA 0 0 0 0 D20S16:Satellite:Satellite 2 4 0 8 DNA1_Mam:TcMar:DNA 3 4 2 5 ERV3-16A3_I-int:ERVL:LTR 281 306 78 376 ERV3-16A3_LTR:ERVL:LTR 23 16 9 28 ERVL-B4-int:ERVL:LTR 125 120 72 151 ERVL-E-int:ERVL:LTR 262 198 79 268 ERVL-int:ERVL:LTR 22 10 13 14 Eulor10:DNA:DNA? 0 0 0 0 Eulor11:DNA:DNA 0 0 1 0 Eulor12:DNA:DNA? 1 0 0 1 Eulor1:DNA:DNA 1 1 1 2

And the results seem that TE count ()gene id) aggregate based on transcripts but not the subfamilies. according to the TE gtf;: chr1 hg19_rmsk exon 100004963 100005398 2754 - . gene_id "Tigger2a"; transcript_id "Tigger2a_dup99"; family_id "TcMar-Tigger"; class_id "DNA"; chr1 hg19_rmsk exon 10000540 10000674 1079 - . gene_id "AluY"; transcript_id "AluY_dup614"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100005498 100005720 1007 + . gene_id "MLT1C"; transcript_id "MLT1C_dup643"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100005720 100006238 3636 - . gene_id "MER1A"; transcript_id "MER1A_dup101"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 100006239 100006294 341 + . gene_id "MLT1C"; transcript_id "MLT1C_dup644"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100006551 100007134 767 - . gene_id "L1MEg"; transcript_id "L1MEg_dup563"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007242 100007566 245 - . gene_id "L1MEg"; transcript_id "L1MEg_dup564"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007589 100007893 1999 - . gene_id "AluJr"; transcript_id "AluJr_dup3116"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100008691 100008998 2263 - . gene_id "AluSc5"; transcript_id "AluSc5_dup264"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100009353 100009528 748 + . gene_id "MER5B"; transcript_id "MER5B_dup632"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 10001028 10001188 277 + . gene_id "L2c"; transcript_id "L2c_dup255"; family_id "L2"; class_id "LINE"; chr1 hg19_rmsk exon 100010663 100010883 390 - . gene_id "MIR"; transcript_id "MIR_dup8810"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100011584 100011894 2220 + . gene_id "AluSz"; transcript_id "AluSz_dup4266"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 10001248 10001552 2167 - . gene_id "AluSx1"; transcript_id "AluSx1_dup548"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100013022 100013237 505 + . gene_id "MIRc"; transcript_id "MIRc_dup5614"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100013916 100014050 354 - . gene_id "L1M5"; transcript_id "L1M5_dup2005"; family_id "L1"; class_id "LINE";

And the according to the TE gtf, the number of gene id is more than 1000.

olivertam commented 2 years ago

Hi,

I'm not sure how you concluded that there are more than 1000 gene_id in the GTF file. If you just extract the gene_id value and do a sort -u, you should get slightly less than 1000 unique gene_id. Note that each line of the GTF refers to a particular copy (defined by a unique transcript_id), but they may share gene_id with other lines. You count table looks as expected to me.

Thanks.

qiangfan2022 commented 2 years ago

Hi,

I'm not sure how you concluded that there are more than 1000 gene_id in the GTF file. If you just extract the gene_id value and do a sort -u, you should get slightly less than 1000 unique gene_id. Note that each line of the GTF refers to a particular copy (defined by a unique transcript_id), but they may share gene_id with other lines. You count table looks as expected to me.

Thanks.

Thanks so much!!! i have cofirmed following your method.

qiangfan2022 commented 2 years ago

Hi,

I'm not sure how you concluded that there are more than 1000 gene_id in the GTF file. If you just extract the gene_id value and do a sort -u, you should get slightly less than 1000 unique gene_id. Note that each line of the GTF refers to a particular copy (defined by a unique transcript_id), but they may share gene_id with other lines. You count table looks as expected to me.

Thanks.

I'm sorry to bother you again. Because I have hundreds of bam, and using TEtranscript or TEcount is slow. So I'm going to trying GTF files related to tetranscripts and use other tools such as featurecounts to directly obtain counts. Do -- mode multi in tetranscripts mean overlapping or mapping?

Thanks

olivertam commented 2 years ago

Hi,

--mode multi refers to taking multimappers into account. If you're using featureCounts, you would probably want to use the following flags: -f, -O, -M, --fraction and -s (if you're using stranded libraries)

I'm assuming that you can't easily run multiple TEcount at once, but if you could, you can speed things up by a) pre-sorting your BAM files by read name (and thus remove the --sortByPos parameter, and b) using a pre-built index available here (I think you're using GRCh37 GENCODE). You just have to decompress it (it's gzipped), and use that instead of the GTF (to skip building the TE index).

Thanks.

mhammell-laboratory / TEtranscripts

count matrix contain >60000 genes, but only <1000 TE #108