Closed qiangfan2022 closed 2 years ago
Hi,
Are you saying that there are <1000 entries for TE? This is expected, as TEtranscripts
aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome.
On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match.
Thanks.
And also reported the error: /home/dell/miniconda3/envs/tetranscripts/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory
Hi,
This would be an error stemming from your R installation, rather than TEtranscripts
. You might be able to find the solution here.
Thanks.
Hi,
Are you saying that there are <1000 entries for TE? This is expected, as
TEtranscripts
aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome. On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match.Thanks.
Thanks very much, this is the details of count matrix "ENSG00000289641.1_1" 0 0 0 0 "ENSG00000289642.1_1" 0 0 0 1 "ENSG00000289643.1_1" 0 0 0 0 "ENSG00000289644.1_1" 0 0 0 0 (CATTC)n:Satellite:Satellite 191 54 136 25 (GAATG)n:Satellite:Satellite 143 63 92 28 7SK:RNA:RNA 35 24 34 19 ACRO1:acro:Satellite 2 1 1 3 ALINE:RTE:LINE 5 3 3 0 ALR/Alpha:centr:Satellite 795 180 202 113 AluJb:Alu:SINE 6512 5683 2840 6973 AluJo:Alu:SINE 3909 3387 1558 3910 AluJr4:Alu:SINE 698 576 257 740 AluJr:Alu:SINE 3776 3288 1509 4188 AluSc5:Alu:SINE 315 221 139 313 AluSc8:Alu:SINE 1071 1064 515 1181 AluSc:Alu:SINE 1676 1535 770 1729 AluSg4:Alu:SINE 360 372 204 399 AluSg7:Alu:SINE 330 302 166 393 AluSg:Alu:SINE 2114 2113 1049 2300 AluSp:Alu:SINE 3143 3042 1515 3306 AluSq10:Alu:SINE 81 80 26 79 AluSq2:Alu:SINE 3237 2767 1484 3225 AluSq4:Alu:SINE 53 39 27 43 AluSq:Alu:SINE 1269 1153 608 1191 AluSx1:Alu:SINE 5818 5235 2668 5914 AluSx3:Alu:SINE 1375 1264 601 1402 AluSx4:Alu:SINE 286 280 123 270 AluSx:Alu:SINE 7041 6401 3333 7242 AluSz6:Alu:SINE 2288 1966 965 2375 AluSz:Alu:SINE 4956 4598 2328 5287 AluY:Alu:SINE 5121 4805 2606 5155 AluYa5:Alu:SINE 323 242 187 222 AluYa8:Alu:SINE 3 4 1 4 AluYb8:Alu:SINE 98 65 33 81 AluYb9:Alu:SINE 3 15 5 13 AluYc3:Alu:SINE 22 22 10 21 AluYc5:Alu:SINE 0 0 0 1 AluYc:Alu:SINE 190 208 120 208 AluYd8:Alu:SINE 1 2 0 3 AluYf4:Alu:SINE 48 37 23 61 AluYf5:Alu:SINE 2 1 1 0 AluYg6:Alu:SINE 21 22 13 33 AluYh9:Alu:SINE 4 6 1 2 AluYk11:Alu:SINE 9 6 4 8 AluYk12:Alu:SINE 81 75 43 52 AluYk4:Alu:SINE 100 81 49 75 AmnSINE1:Deu:SINE 14 10 1 9 AmnSINE2:Deu:SINE 4 0 1 3 Arthur1:hAT-Tip100:DNA 40 38 14 35 Arthur1A:hAT-Tip100:DNA 29 13 5 24 Arthur1B:hAT-Tip100:DNA 40 37 15 55 Arthur1C:hAT-Tip100:DNA 8 6 5 13 BLACKJACK:hAT-Blackjack:DNA 74 73 28 70 BSR/Beta:Satellite:Satellite 50 14 14 45 CER:Satellite:Satellite 8 6 1 3 CR1_Mam:CR1:LINE 156 138 69 142 Charlie10:hAT-Charlie:DNA 45 57 22 35 Charlie10a:hAT-Charlie:DNA 6 4 2 7 Charlie10b:hAT-Charlie:DNA 5 8 3 13 Charlie11:hAT-Charlie:DNA 4 3 0 1 Charlie12:hAT-Charlie:DNA 0 1 0 1 Charlie13a:hAT-Charlie:DNA 21 13 7 15 Charlie13b:hAT-Charlie:DNA 19 9 13 18 Charlie14a:hAT-Charlie:DNA 5 4 0 9 Charlie15a:hAT-Charlie:DNA 117 129 42 126 Charlie16a:hAT-Charlie:DNA 82 53 25 87 Charlie17a:hAT-Charlie:DNA 77 100 20 111 Charlie18a:hAT-Charlie:DNA 94 73 31 82 Charlie19a:hAT-Charlie:DNA 68 65 25 100 Charlie1:hAT-Charlie:DNA 205 188 86 181 Charlie1a:hAT-Charlie:DNA 350 261 140 323 Charlie1b:hAT-Charlie:DNA 123 125 50 121 Charlie1b_Mars:hAT-Charlie:DNA 0 0 0 0 Charlie20a:hAT-Charlie:DNA 22 22 7 32 Charlie21a:hAT-Charlie:DNA 38 26 9 59 Charlie22a:hAT-Charlie:DNA 26 29 12 25 Charlie23a:hAT-Charlie:DNA 36 28 10 34 Charlie24:hAT-Charlie:DNA 40 41 11 45 Charlie25:hAT-Charlie:DNA 20 18 11 17 Charlie26a:hAT-Charlie:DNA 3 6 2 11 Charlie2a:hAT-Charlie:DNA 200 212 108 213 Charlie2b:hAT-Charlie:DNA 205 187 68 179 Charlie3:hAT-Charlie:DNA 13 11 3 13 Charlie4:hAT-Charlie:DNA 8 5 2 5 Charlie4a:hAT-Charlie:DNA 121 113 50 115 Charlie4z:hAT-Charlie:DNA 149 109 39 155 Charlie5:hAT-Charlie:DNA 159 121 55 136 Charlie6:hAT-Charlie:DNA 8 7 2 6 Charlie7:hAT-Charlie:DNA 83 65 33 104 Charlie7a:hAT-Charlie:DNA 20 21 10 28 Charlie8:hAT-Charlie:DNA 214 186 62 243 Charlie9:hAT-Charlie:DNA 27 29 7 46 CheshMITE:hAT-Charlie:DNA 0 0 0 0 Cheshire:hAT-Charlie:DNA 21 21 7 23 CheshireMars:hAT-Charlie:DNA 0 0 0 0 D20S16:Satellite:Satellite 2 4 0 8 DNA1_Mam:TcMar:DNA 3 4 2 5 ERV3-16A3_I-int:ERVL:LTR 281 306 78 376 ERV3-16A3_LTR:ERVL:LTR 23 16 9 28 ERVL-B4-int:ERVL:LTR 125 120 72 151 ERVL-E-int:ERVL:LTR 262 198 79 268 ERVL-int:ERVL:LTR 22 10 13 14 Eulor10:DNA:DNA? 0 0 0 0 Eulor11:DNA:DNA 0 0 1 0 Eulor12:DNA:DNA? 1 0 0 1 Eulor1:DNA:DNA 1 1 1 2
And the results seem that TE count ()gene id) aggregate based on transcripts but not the subfamilies. according to the TE gtf;: chr1 hg19_rmsk exon 100004963 100005398 2754 - . gene_id "Tigger2a"; transcript_id "Tigger2a_dup99"; family_id "TcMar-Tigger"; class_id "DNA"; chr1 hg19_rmsk exon 10000540 10000674 1079 - . gene_id "AluY"; transcript_id "AluY_dup614"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100005498 100005720 1007 + . gene_id "MLT1C"; transcript_id "MLT1C_dup643"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100005720 100006238 3636 - . gene_id "MER1A"; transcript_id "MER1A_dup101"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 100006239 100006294 341 + . gene_id "MLT1C"; transcript_id "MLT1C_dup644"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100006551 100007134 767 - . gene_id "L1MEg"; transcript_id "L1MEg_dup563"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007242 100007566 245 - . gene_id "L1MEg"; transcript_id "L1MEg_dup564"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007589 100007893 1999 - . gene_id "AluJr"; transcript_id "AluJr_dup3116"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100008691 100008998 2263 - . gene_id "AluSc5"; transcript_id "AluSc5_dup264"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100009353 100009528 748 + . gene_id "MER5B"; transcript_id "MER5B_dup632"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 10001028 10001188 277 + . gene_id "L2c"; transcript_id "L2c_dup255"; family_id "L2"; class_id "LINE"; chr1 hg19_rmsk exon 100010663 100010883 390 - . gene_id "MIR"; transcript_id "MIR_dup8810"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100011584 100011894 2220 + . gene_id "AluSz"; transcript_id "AluSz_dup4266"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 10001248 10001552 2167 - . gene_id "AluSx1"; transcript_id "AluSx1_dup548"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100013022 100013237 505 + . gene_id "MIRc"; transcript_id "MIRc_dup5614"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100013916 100014050 354 - . gene_id "L1M5"; transcript_id "L1M5_dup2005"; family_id "L1"; class_id "LINE";
Hi, Are you saying that there are <1000 entries for TE? This is expected, as
TEtranscripts
aggregate TE counts based on "subfamilies" (e.g. L1HS), of which there are ~1000 of them. Thus each TE entry may have counts coming from various copies of that TE throughout the genome. On an unrelated note, if you are aligning to hg19 from UCSC, you should be aware that the nomenclature for scaffolds, alt haplotypes and patches are named differently from GENCODE. Thus, any feature (gene or TE) on those chromosomes will not be quantified if the chromosome names don't match. Thanks.Thanks very much, this is the details of count matrix "ENSG00000289641.1_1" 0 0 0 0 "ENSG00000289642.1_1" 0 0 0 1 "ENSG00000289643.1_1" 0 0 0 0 "ENSG00000289644.1_1" 0 0 0 0 (CATTC)n:Satellite:Satellite 191 54 136 25 (GAATG)n:Satellite:Satellite 143 63 92 28 7SK:RNA:RNA 35 24 34 19 ACRO1:acro:Satellite 2 1 1 3 ALINE:RTE:LINE 5 3 3 0 ALR/Alpha:centr:Satellite 795 180 202 113 AluJb:Alu:SINE 6512 5683 2840 6973 AluJo:Alu:SINE 3909 3387 1558 3910 AluJr4:Alu:SINE 698 576 257 740 AluJr:Alu:SINE 3776 3288 1509 4188 AluSc5:Alu:SINE 315 221 139 313 AluSc8:Alu:SINE 1071 1064 515 1181 AluSc:Alu:SINE 1676 1535 770 1729 AluSg4:Alu:SINE 360 372 204 399 AluSg7:Alu:SINE 330 302 166 393 AluSg:Alu:SINE 2114 2113 1049 2300 AluSp:Alu:SINE 3143 3042 1515 3306 AluSq10:Alu:SINE 81 80 26 79 AluSq2:Alu:SINE 3237 2767 1484 3225 AluSq4:Alu:SINE 53 39 27 43 AluSq:Alu:SINE 1269 1153 608 1191 AluSx1:Alu:SINE 5818 5235 2668 5914 AluSx3:Alu:SINE 1375 1264 601 1402 AluSx4:Alu:SINE 286 280 123 270 AluSx:Alu:SINE 7041 6401 3333 7242 AluSz6:Alu:SINE 2288 1966 965 2375 AluSz:Alu:SINE 4956 4598 2328 5287 AluY:Alu:SINE 5121 4805 2606 5155 AluYa5:Alu:SINE 323 242 187 222 AluYa8:Alu:SINE 3 4 1 4 AluYb8:Alu:SINE 98 65 33 81 AluYb9:Alu:SINE 3 15 5 13 AluYc3:Alu:SINE 22 22 10 21 AluYc5:Alu:SINE 0 0 0 1 AluYc:Alu:SINE 190 208 120 208 AluYd8:Alu:SINE 1 2 0 3 AluYf4:Alu:SINE 48 37 23 61 AluYf5:Alu:SINE 2 1 1 0 AluYg6:Alu:SINE 21 22 13 33 AluYh9:Alu:SINE 4 6 1 2 AluYk11:Alu:SINE 9 6 4 8 AluYk12:Alu:SINE 81 75 43 52 AluYk4:Alu:SINE 100 81 49 75 AmnSINE1:Deu:SINE 14 10 1 9 AmnSINE2:Deu:SINE 4 0 1 3 Arthur1:hAT-Tip100:DNA 40 38 14 35 Arthur1A:hAT-Tip100:DNA 29 13 5 24 Arthur1B:hAT-Tip100:DNA 40 37 15 55 Arthur1C:hAT-Tip100:DNA 8 6 5 13 BLACKJACK:hAT-Blackjack:DNA 74 73 28 70 BSR/Beta:Satellite:Satellite 50 14 14 45 CER:Satellite:Satellite 8 6 1 3 CR1_Mam:CR1:LINE 156 138 69 142 Charlie10:hAT-Charlie:DNA 45 57 22 35 Charlie10a:hAT-Charlie:DNA 6 4 2 7 Charlie10b:hAT-Charlie:DNA 5 8 3 13 Charlie11:hAT-Charlie:DNA 4 3 0 1 Charlie12:hAT-Charlie:DNA 0 1 0 1 Charlie13a:hAT-Charlie:DNA 21 13 7 15 Charlie13b:hAT-Charlie:DNA 19 9 13 18 Charlie14a:hAT-Charlie:DNA 5 4 0 9 Charlie15a:hAT-Charlie:DNA 117 129 42 126 Charlie16a:hAT-Charlie:DNA 82 53 25 87 Charlie17a:hAT-Charlie:DNA 77 100 20 111 Charlie18a:hAT-Charlie:DNA 94 73 31 82 Charlie19a:hAT-Charlie:DNA 68 65 25 100 Charlie1:hAT-Charlie:DNA 205 188 86 181 Charlie1a:hAT-Charlie:DNA 350 261 140 323 Charlie1b:hAT-Charlie:DNA 123 125 50 121 Charlie1b_Mars:hAT-Charlie:DNA 0 0 0 0 Charlie20a:hAT-Charlie:DNA 22 22 7 32 Charlie21a:hAT-Charlie:DNA 38 26 9 59 Charlie22a:hAT-Charlie:DNA 26 29 12 25 Charlie23a:hAT-Charlie:DNA 36 28 10 34 Charlie24:hAT-Charlie:DNA 40 41 11 45 Charlie25:hAT-Charlie:DNA 20 18 11 17 Charlie26a:hAT-Charlie:DNA 3 6 2 11 Charlie2a:hAT-Charlie:DNA 200 212 108 213 Charlie2b:hAT-Charlie:DNA 205 187 68 179 Charlie3:hAT-Charlie:DNA 13 11 3 13 Charlie4:hAT-Charlie:DNA 8 5 2 5 Charlie4a:hAT-Charlie:DNA 121 113 50 115 Charlie4z:hAT-Charlie:DNA 149 109 39 155 Charlie5:hAT-Charlie:DNA 159 121 55 136 Charlie6:hAT-Charlie:DNA 8 7 2 6 Charlie7:hAT-Charlie:DNA 83 65 33 104 Charlie7a:hAT-Charlie:DNA 20 21 10 28 Charlie8:hAT-Charlie:DNA 214 186 62 243 Charlie9:hAT-Charlie:DNA 27 29 7 46 CheshMITE:hAT-Charlie:DNA 0 0 0 0 Cheshire:hAT-Charlie:DNA 21 21 7 23 CheshireMars:hAT-Charlie:DNA 0 0 0 0 D20S16:Satellite:Satellite 2 4 0 8 DNA1_Mam:TcMar:DNA 3 4 2 5 ERV3-16A3_I-int:ERVL:LTR 281 306 78 376 ERV3-16A3_LTR:ERVL:LTR 23 16 9 28 ERVL-B4-int:ERVL:LTR 125 120 72 151 ERVL-E-int:ERVL:LTR 262 198 79 268 ERVL-int:ERVL:LTR 22 10 13 14 Eulor10:DNA:DNA? 0 0 0 0 Eulor11:DNA:DNA 0 0 1 0 Eulor12:DNA:DNA? 1 0 0 1 Eulor1:DNA:DNA 1 1 1 2
And the results seem that TE count ()gene id) aggregate based on transcripts but not the subfamilies. according to the TE gtf;: chr1 hg19_rmsk exon 100004963 100005398 2754 - . gene_id "Tigger2a"; transcript_id "Tigger2a_dup99"; family_id "TcMar-Tigger"; class_id "DNA"; chr1 hg19_rmsk exon 10000540 10000674 1079 - . gene_id "AluY"; transcript_id "AluY_dup614"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100005498 100005720 1007 + . gene_id "MLT1C"; transcript_id "MLT1C_dup643"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100005720 100006238 3636 - . gene_id "MER1A"; transcript_id "MER1A_dup101"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 100006239 100006294 341 + . gene_id "MLT1C"; transcript_id "MLT1C_dup644"; family_id "ERVL-MaLR"; class_id "LTR"; chr1 hg19_rmsk exon 100006551 100007134 767 - . gene_id "L1MEg"; transcript_id "L1MEg_dup563"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007242 100007566 245 - . gene_id "L1MEg"; transcript_id "L1MEg_dup564"; family_id "L1"; class_id "LINE"; chr1 hg19_rmsk exon 100007589 100007893 1999 - . gene_id "AluJr"; transcript_id "AluJr_dup3116"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100008691 100008998 2263 - . gene_id "AluSc5"; transcript_id "AluSc5_dup264"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100009353 100009528 748 + . gene_id "MER5B"; transcript_id "MER5B_dup632"; family_id "hAT-Charlie"; class_id "DNA"; chr1 hg19_rmsk exon 10001028 10001188 277 + . gene_id "L2c"; transcript_id "L2c_dup255"; family_id "L2"; class_id "LINE"; chr1 hg19_rmsk exon 100010663 100010883 390 - . gene_id "MIR"; transcript_id "MIR_dup8810"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100011584 100011894 2220 + . gene_id "AluSz"; transcript_id "AluSz_dup4266"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 10001248 10001552 2167 - . gene_id "AluSx1"; transcript_id "AluSx1_dup548"; family_id "Alu"; class_id "SINE"; chr1 hg19_rmsk exon 100013022 100013237 505 + . gene_id "MIRc"; transcript_id "MIRc_dup5614"; family_id "MIR"; class_id "SINE"; chr1 hg19_rmsk exon 100013916 100014050 354 - . gene_id "L1M5"; transcript_id "L1M5_dup2005"; family_id "L1"; class_id "LINE";
And the according to the TE gtf, the number of gene id is more than 1000.
Hi,
I'm not sure how you concluded that there are more than 1000 gene_id
in the GTF file. If you just extract the gene_id
value and do a sort -u
, you should get slightly less than 1000 unique gene_id
. Note that each line of the GTF refers to a particular copy (defined by a unique transcript_id
), but they may share gene_id
with other lines.
You count table looks as expected to me.
Thanks.
Hi,
I'm not sure how you concluded that there are more than 1000
gene_id
in the GTF file. If you just extract thegene_id
value and do asort -u
, you should get slightly less than 1000 uniquegene_id
. Note that each line of the GTF refers to a particular copy (defined by a uniquetranscript_id
), but they may sharegene_id
with other lines. You count table looks as expected to me.Thanks.
Thanks so much!!! i have cofirmed following your method.
Hi,
I'm not sure how you concluded that there are more than 1000
gene_id
in the GTF file. If you just extract thegene_id
value and do asort -u
, you should get slightly less than 1000 uniquegene_id
. Note that each line of the GTF refers to a particular copy (defined by a uniquetranscript_id
), but they may sharegene_id
with other lines. You count table looks as expected to me.Thanks.
I'm sorry to bother you again. Because I have hundreds of bam, and using TEtranscript or TEcount is slow. So I'm going to trying GTF files related to tetranscripts and use other tools such as featurecounts to directly obtain counts. Do -- mode multi in tetranscripts mean overlapping or mapping?
Thanks
Hi,
--mode multi
refers to taking multimappers into account.
If you're using featureCounts, you would probably want to use the following flags: -f
, -O
, -M
, --fraction
and -s
(if you're using stranded libraries)
I'm assuming that you can't easily run multiple TEcount at once, but if you could, you can speed things up by a) pre-sorting your BAM files by read name (and thus remove the --sortByPos
parameter, and b) using a pre-built index available here (I think you're using GRCh37 GENCODE). You just have to decompress it (it's gzipped), and use that instead of the GTF (to skip building the TE index).
Thanks.
This is my code: TEtranscripts --sortByPos --format BAM --mode multi -t SRR13797132Aligned.sortedByCoord.out.bam SRR13797133Aligned.sortedByCoord.out.bam -c SRR13797134Aligned.sortedByCoord.out.bam SRR13797135Aligned.sortedByCoord.out.bam --GTF /home/dell/database/hg19/gencode.v39lift37.annotation.gtf --TE /home/dell/database/tetranscripts/GRCh37_GENCODE_rmsk_TE.gtf --project sample_sorted_test
I have performed mapping by STAR with hg19, and i have download the gtf from gencode for genes and gtf fot TE from https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/GRCh37_GENCODE_rmsk_TE.gtf.gz.
When run the code, the information just like the following:
After finish the process, the count matrix contain >60000 genes, but only <1000 TE, and i wonder how to resovle?