mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

What kind of data does control sample needed and question about personal GTF? #70

Closed Jiaqichen1202 closed 1 year ago

Jiaqichen1202 commented 4 years ago

HI, I use TEtranscripts for 30 kinds of tissue, the data was Paired-end, I don't know what kind of data should I put in control sample, what's the function of --control?Can I use Hisat2 for mapping? Thank you Chen

olivertam commented 4 years ago

Hi Chen,

You can certainly use Hisat2 for mapping, though there are caveats:

Are you interested in quantifying TE expression, or comparing expression between tissues (or both)? If you're interested in just the former, we would recommend using TEcount, since you do not have to provide a "control" library. If you are interested in the latter, then you use --control to indicate the tissue that you want to use as a baseline to compare against.

For example, if you are trying to determine what genes are expressed differently in brain compared to liver, then your --treatment would be the brain tissue, and your --control would be the liver tissue. You would then obtain log 2 fold change values that are based on brain exp over liver exp, so a l2fc > 0 would mean that it's higher in brain than liver (brain exp / liver exp > 1), and a l2fc < 0 would mean that it's lower in brain than liver (brain exp / liver exp < 1). If you want to do the opposite comparison, then switch the libraries for --treatment and --control.

Please let me know if you have any questons. Thanks.

Jiaqichen1202 commented 4 years ago

Thanks for reply, I decided to use TEcount. There is another question about personal GTF. I made it with gene annotations GTF and Repeatmasker by bedtool intersect like this:

gene annotations GTF: chr1 PacBio exon 40000 41000 . + . gene_id "aa"; transcript_id "PB.1.1"; chr1 PacBio exon 42000 43000 . + . gene_id bb"; transcript_id "PB.1.1";

Repeatmasker output:

28 1.9 3.1 9.8 chr1 30000 40200 L1M5 LINE/L1 1 61 (0) 1 880 18.3 2.3 12.2 chr1 42300 43500 L1M5 LINE/L1 1 959 (0) 2

So the TE.gtf is like this: chr1 PacBio exon 40000 40200 . + . gene_id "L1M5"; transcript_id "PB.1.1"; family_id "LINE/L1; class_id "LINE/L1"; chr1 PacBio exon 42300 43000 . + . gene_id "L1M5"; transcript_id "PB.1.1"; family_id "LINE/L1; class_id "LINE/L1";

Is this OK? The gene annotations GTF gene_id and TE GTF gene_id must be consistent? Since my gene annotations GTF was build by mapping to NR and uniprot, I don't know whether it's sensitive enough for TE. And what should I do (if the TE GTF above can be used )when the two gene id is different?

Thanks Chen

olivertam commented 4 years ago

Hi, I'm unsure why you are combining your gene annotations with RepeatMasker. Are your "gene" annotations TE predictions? Otherwise, you should keep them as two separate files, as TEcount requires a gene annotation AND a TE annotation. For your TE GTF file, you should aim to have a transcript_id that is unique, whereas the gene_id should be a subfamily of TE (so your gene_id examples are fine). I would have altered the family_id to L1, and class_id to LINE (since the final output would concatenate them), but they shouldn't break the program. Thanks.

Jiaqichen1202 commented 4 years ago

Hi, I think the gene annotation includes TE prediction. The reason why I combine the two GTF is RepeatMasker predicted TE with genomic data, and I think the TE gtf should provide exon locations. So I check the gene annotation gtf to find what's exons located on the position which was predicted to be TE in genome , and then I use these exons to build TE gtf. So maybe I should Use the repeatmasker output as TE gtf directly? Thanks Chen

olivertam commented 4 years ago

Hi, Most gene annotations (e.g. Ensembl, RefSeq) do not take TE into account. In your case, you are using Uniprot and NR (is it NCBI's nr database, or something else?). Neither Uniprot or NCBI's nr databases are likely to give you TE annotations. You are correct that RepeatMasker predicts TE with genomic data, and technically are not predicting them based on spliced isoforms. However, spliced TE are poorly characterized, and thus there are few annotations where we can call "intronic". Thus, we are treating RepeatMasker output as a list of "exonic" TE regions. Thus, we would recommend converting the RepeatMasker output into a TE GTF, and use the gene predictions as the gene GTF. However, if you think that your gene GTF is containing TE exons, then you can try to intersect the two datasets. If you are able to find an "exon" for each of the RepeatMasker annotation, then it is possible (but still very unlikely) that your gene annotations contain all TEs. Thanks.

Jiaqichen1202 commented 4 years ago

Hi, So even there are some exons annotated on TE position, they are still not TE's exons? Or TE exon annotation is meaningless due to its low identity? Thanks Chen

olivertam commented 4 years ago

Hi, Unless your gene set (used for your gene annotation) specifically has TE (Uniprot generally doesn't, or at most 1 or 2 that have been co-opted into genes), then you're more likely to be getting overlaps between TE and gene exons (which do exist). I am not sure what your NR gene set is, but again, unless they specifically included TE sequences in that set, you are probably not getting a TE-specific exon. Thanks.

yingjin07 commented 4 years ago

Hi Chen,

Yes. the "exon" field in TE GTF doesn't have biological meaning. The only reason is to make the TE annotation file has the similar format as Gene GTF file.

Best,

Ying

Ying Jin PhD

Computational Science Manager Bioinformatics Shared Resources Cold Spring Harbor Lab 516-367-5190 yjin at cshl dot edu


From: Jiaqichen1202 notifications@github.com Sent: Friday, May 15, 2020 10:21:31 PM To: mhammell-laboratory/TEtranscripts Cc: Subscribed Subject: Re: [mhammell-laboratory/TEtranscripts] What kind of data does control sample needed and question about personal GTF? (#70)

Hi, So even there are some exons annotated on TE position, they are still not TE's exons? Or TE exon annotation is meaningless due to its low identity? Thanks Chen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mhammell-2Dlaboratory_TEtranscripts_issues_70-23issuecomment-2D629574780&d=DwMCaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=4KRE0zvN4-P2SAzpXyOzmw&m=0-8LFm7zzssUCq3-RqXPfwi7SJIvAnYzKQlUVsnmyo4&s=9Yeo5oQy_PHpt5SvWo3nJGPDLPDAg_tvCnShxTlCzck&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADVC56FD6N6XGIXS2W73PJTRRX2CXANCNFSM4M625XPA&d=DwMCaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=4KRE0zvN4-P2SAzpXyOzmw&m=0-8LFm7zzssUCq3-RqXPfwi7SJIvAnYzKQlUVsnmyo4&s=dVaCECdmX2URc7dGIZQ_a1qbT7-LJ-onWQ7b5G6fLIE&e=.

Jiaqichen1202 commented 4 years ago

HI, I used TEcounts and have a result now, I‘m not sure how it counts locus annotated by both gene gtf and te gtf (even gene gtf has few TE annotation, but it still include some orf ). So the output counts gene 1,TE 1 or gene 0,TE1 or gene1,TE 0?

Thanks Chen

olivertam commented 4 years ago

Hi, This depends on whether the read is uniquely aligned to the genome, or aligned to multiple location. If the read is uniquely aligned and maps to both gene and TE, then the output is gene 1, TE 0. If the read is ambiguously aligned, with the alignments overlapping both genes and TE, then the output is biased towards TE (gene 0, TE 1 if we are treating it as a binary yes/no). Thanks.

Jiaqichen1202 commented 4 years ago

HI, Another question is do I need correct the gene(TE) length in Deseq2? The raw abundance counts in cntTable include a total counts of a TE with muti copies and every copy has different length, how do it add together and finally gives a total counts?

Thanks chen

olivertam commented 4 years ago

Hi,

If you are trying to do differential analysis with DESeq2, there is no need to correct for gene length. Thanks.

Cheers, Oliver

Jiaqichen1202 commented 4 years ago

HI, Is the TEtranscripts output data used in Fig.4 and supplementary Fig.5 in article TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets just cntTable data deal with Log2 fold changes and without DESEQ? Thanks Chen

olivertam commented 4 years ago

Hi, The log 2 fold changes were calculated by DESeq (with their median of ratio normalization) using the cntTable data, which is run as part of TEtranscripts. Please note that this data was generated from DESeq, not DESeq2. Thanks.

Jiaqichen1202 commented 4 years ago

Hi, The problem that always bothers me is that whether cntTable data should be corrected by some ways. What's the difference between the correction of genes and TEs? TPM and FPKM maybe not apply to TEs correction. Thanks Chen

olivertam commented 4 years ago

Hi, Could you expand on what you mean by "correction"? For the purposes of differential expression with DESeq, there is no requirement for adjusting based on feature length for either genes or TE. Both DESeq and edgeR recommends raw counts. If you want to perform intra-sample comparison, then yes, you would probably want to do TPM. TEtranscripts is not designed for comparing features within a sample. Also, TPM is not recommended for differential expression analysis. If you want to perform a TPM or FPKM, you probably would want to be quantifying at the locus level. If you want to enable intra-sample comparison AND differential analysis, edgeR's trimmed mean of M values might be suitable. Thanks.

Jiaqichen1202 commented 4 years ago

HI, The raw abundance counts in cntTable include a total counts of a TE with muti copies and every copy has different length, how do it add together and finally gives a total counts? When use Deseq2 for differential expression, normalizeGeneLength() was always unsed. What's the mean of "locus level" ? And "intra-sample comparison" means compare genes within a sample? Thanks. Chen

olivertam commented 4 years ago

Hi Chen, You are right that it is including adding together all the TE with different read lengths, and it is just adding the raw counts (with some EM aided redistribution of ambiguously mapped reads). According to what others are saying in forums, the gene length doesn't matter for differential analysis (this is one discussion that I pulled up in my search for an answer to your question). I don't think I have ever used normalizeGeneLength() when I did differential analysis with DESeq2, and I would like to hear more about how you used it. It may be a scenario that I do not commonly encounter. Yes, you are correct that "Intra-sample comparison" means comparing genes within a sample (e.g. ACTB vs GAPDH), and that is not something that TEtranscripts (nor DESeq2, if I recall) can do. Regarding "locus level", it means that instead of adding together all the different copies of the TE with different lengths, we treat each of them as an independent feature, and try to quantify each of them separately. It would then allow you to do the length normalization that you are interested in. Thanks for your interest.

Jiaqichen1202 commented 4 years ago

HI, How to normalize the TE in the different samples by the same length and the different sequence depth ? By give a same length to all TEs? Thanks Chen

olivertam commented 4 years ago

Hi Chen,

Ah, now I understand your question. So you have correctly pointed the difficulty with TEtranscripts in the way that it aggregates all copies of a particular TE. If you want to stick with TEtranscripts output, then you can aggregate the length of all the copies of the TE (based on the GTF), and use that as the "length" for each TE type (e.g. L1M5). Let me know if that is sufficient for your analysis.

Thanks.

Jiaqichen1202 commented 4 years ago

Hi, Now I find CPM can normalize depth without gene length. I will use fpm_table_normalized<- fpm(dds, robust = TRUE) in DEseq2. Finally I give up to processing about length in same breed, as for different breeds, I will try what you advice for "length".
Thanks Chen

Jiaqichen1202 commented 4 years ago

HI, I did an analysis of 30 tissues from each of the two breeds using TEcounts. The process was identical except that both breeds had their own genome sequences and gene annotation files. The two genome lengths as well as the repeat sequence content were almost identical. Their TEcounts results were particularly different like this:

 -- A B
Charlie4a:DNA/hAT-Charlie:DNA/hAT-Charlie 0 2081
Charlie4z:DNA/hAT-Charlie:DNA/hAT-Charlie 0 3195
Charlie5:DNA/hAT-Charlie:DNA/hAT-Charlie 1 1621
Charlie6:DNA/hAT-Charlie:DNA/hAT-Charlie 0 309
Charlie7:DNA/hAT-Charlie:DNA/hAT-Charlie 1 1472
ERV1-4_SSc-I:LTR/ERV1:LTR/ERV1 0 92
ERV1-4_SSc-LTR:LTR/ERV1:LTR/ERV1 0 186
ERV1N-1A2_SSc-I:LTR/ERV1:LTR/ERV1 0 107
HAL1:LINE/L1:LINE/L1 29 14234
HAL1M8:LINE/L1:LINE/L1 1 1053
HAL1ME:LINE/L1:LINE/L1 1 2936
HAL1_SS:LINE/L1:LINE/L1 121 17663

B is the reference genome provided by NCBI, and A is the unpublished genome that we assembled and annotated ourselves. Genome synthesis assessment of A is better than B. Both A and B are pig breeds.

In order to eliminate differences in genomes and annotations, I also used the sequence and annotation of B(NCBI ref genome and gtf) for the A comparison. Then the difference between the two TEcounts results is normal.

The problem now is that the difference in transposon expression between the two breeds cannot be that large. The repeat sequence content of the two genomes is also consistent. Our genome annotation for A is also the usual practice of comparing to the database (in my inspection I found annotations to some LINE orfs). Even if some of the transposons are included in the gene annotation file, according to your previous statement - as long as he's multiplexed it will still be annotated as TE, then there shouldn't be such a big difference between the two breeds. Sequencing depth, comparison code, repeatmakser annotation code all the same.

Thanks chen

olivertam commented 4 years ago

Hi Chen,

Please let me know if I understood your workflow correctly.

For genome A, you aligned to the genomic sequence of A that you assembled, and then used the gene annotation GTF of A that you generated yourself as well. For genome B, you are aligning to the NCBI genome and NCBI's GTF. That is your provided table. For your next test, you aligned sequences from A to the B genome and B gene annotations, and got closer results.

What TE GTF annotation did you use for A? Did you run repeatMasker for A, and generated a TE GTF that is specific for A? How different is that from the TE GTF for B? Are you seeing more differences between A & B because you are using the TE GTF generated from A? You mention that overall repeat content is similar, but are the position/abundance of different TE types similar?

Thanks.

Jiaqichen1202 commented 4 years ago

HI, Yes, I did the same process as you said, and the aboundance of different TE types is similar. for A: SINEs: 1756092 350736722 bp 13.85 % LINEs: 945860 515671162 bp 20.36 % LTR elements: 310994 118517682 bp 4.68 % DNA elements: 299231 59260550 bp 2.34 % for B: SINEs: 1770006 354723114 bp 14.18 % LINEs: 959153 526536450 bp 21.05 % LTR elements: 314711 120437693 bp 4.81 % DNA elements: 301118 59609687 bp 2.38 %

Thanks, chen

olivertam commented 4 years ago

Hi Chen,

That is very unusual. I am curious to know if there are differences in the mappability of the two genomes. Are there more unique mappers when aligning to the assembled genome versus the "reference" genome? If most of reads are now aligning unambiguously, then any reads that overlap both genes and TE will bias the assignment towards the gene. I'm not sure if this is the reason, but given what you have described, there must be something unusual with genome A regarding either mappability or overlaps between annotations. Are there more contigs from your assembly? Are the intergenic regions well assembled, or are the gene-rich regions better assembled? Sorry if these questions do not directly address the issue, but we have little experience with assembled genomes, and thus would like to know why there are such discrepancies.

Thanks.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days