mhammell-laboratory / TElocal

A package for quantifying transposable elements at a locus level for RNAseq datasets.
GNU General Public License v3.0
21 stars 8 forks source link

TEcount and TElocal #25

Closed TaoHJiang closed 1 year ago

TaoHJiang commented 1 year ago

Dear Oliver Tam,

Thank you for providing very useful software.

I used TEcount to generate count tables for each sample, and then raw counts of all samples were merged together. After TPM/VST normalization, we do downstream difference analysis, whether the process was reasonable, or i need to replaced TEcount by TElocal.

my code is:

TEcount \ --format BAM \ -b file.bam \ --GTF Downlod_ensemble.gtf \ --TE rmsk_TE.gtf \ --mode multi \ --sortByPos \ --project test -i 10

with regards

olivertam commented 1 year ago

Hi,

It really depends on what you're trying to analyze. If you are looking at overall TE expression (all copies aggregated), then TEcount works well. If you need to look at individual TE locus, you can try TElocal. We typically use downstream differential analysis algorithm, such as DESeq2 and edgeR to perform the normalization (typically not TPM or VST for the purposes of differential expression), and then normalize with VST for visualization.

Thanks.

TaoHJiang commented 1 year ago

Thanks for your reply, the traditional "case control" cannot meet our difference analysis, therefore, we need to write our own script to complete this part. I do not understand the difference between TEcoun and TElocal in nature, since they can both output raw counts of gene and TE. Thanks

olivertam commented 1 year ago

Hi,

TEcount aggregates transposable elements from the same subfamily (e.g. L1HS) in the count table, whereas TElocal has counts for each copy of the transposable element (e.g. L1HS_dup516). Thus, if you are trying to compare total L1HS between samples, you would use TEcount, whereas if you need to look at specific locus, you would use TElocal

Thanks.

TaoHJiang commented 1 year ago

Thanks for your reply, I got it. In addition, I would like to ask if there is any recommendation for the standardized method. You said before that TPM/FPKM is not recommended

olivertam commented 1 year ago

Hi,

Without knowing exactly what you're trying to compare, it would be difficult. You mention that it's not a "case-control" experiment, in which case, are you comparing within the same sample?

Thanks

TaoHJiang commented 1 year ago

Thank you, For example, we have samples of multiple tissue types at different developmental stages, and we want to know that whether a particular TE highly expressed in one tissue at a developmental stage. Thanks

TaoHJiang commented 1 year ago

We wanted to use the conventional standardization method (TPM/FPKM) for quantification and then compare the expression levels in different tissues with development stage Thanks

olivertam commented 1 year ago

Hi,

You can still do "case-control" style comparisons. You can first try to visualize the TE expression using VST-normalized data to see if there's a tissue with higher expression than others (e.g. heatmap). Then you can designate tissues of interest as your "case", and the other tissues in that developmental stage as "control", and then run a case-control comparison between them using the raw counts and a differential analysis algorithm.

Or you can do "pairwise" comparison, where you designate each tissue as a case, and the others as control, and look for differential expression. In that case, I would take all the raw p-value, and do multiple testing correction (e.g. FDR) on all of them

You can also do that with different developmental stages of the same tissue, and (technically) different developmental stages of different tissues. However, the comparison might not be as useful as there are too many variables to account for.

Thanks

TaoHJiang commented 1 year ago

Thanks for your reply, If we want to compare every stage and every tissue, the number of combinations is very large. We now just to find the TE that is highest expressed (Fold change ) in a particular tissue stage as the specific TE, whether VST values can be used for expression levels?

olivertam commented 1 year ago

Hi,

What you can try is to run one comparison with a differential analysis algorithm using all the samples that you want to look at. Since they are performing the normalization as part of the analysis, you can output the normalized values (and/or VST values) to then visualize on a heatmap. It doesn't really matter which exact case-control you run, as long as you can get the algorithm to output normalized counts that you can then process later on (with VST) and visualize (with heatmap).

Thanks.

TaoHJiang commented 1 year ago

Thank you very much for your kind and prompt reply! I will give it a try based on your suggestion.

TaoHJiang commented 1 year ago

Hi, The last question is can I run hundreds of samples (6GB per sample) of data at the same time, roughly how much memory is required. Thanks

olivertam commented 1 year ago

If you are referring to the differential analysis, you will be using the joined count tables, so you should have Gb of data per sample. I've successfully managed 100s of samples with 20 to 30Gb of RAM (though that's me being cautious).

Thanks.

TaoHJiang commented 1 year ago

Thank you very much for your kind and prompt reply! Your suggestions are very instructive and helpful!