hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

Count data for analysis #31

Closed FabianDK closed 4 years ago

FabianDK commented 5 years ago

Thanks for developing salmonTE and congrats to your recent papers.

I have been playing around with your tool and was wondering about the count output data:

Is it correct that the counts in the output file can be directly used for analysis/plots without further normalisation? Or should the TPM output values be used for that?

Many thanks!

hyunhwan-jeong commented 5 years ago

Hello @FabianDK,

It depends on your parameter for the count matrix will be generated. In other words, if you use --exprtype=count SalmonTE will perform the analysis with the count, and TPM will be used otherwise. Whether which exprtype is selected, normalization will be performed for the analysis. I would recommend you to use count instead of TPM if your condition is binary, and it is okay to use TPM if your condition is numeric and you wish to perform a regression analysis.

Hyun-Hwan Jeong

FabianDK commented 5 years ago

Thanks for your reply.

If I understand your R-code correctly, you are using lm on the estimated count data. What I am wondering is: are the count data already corrected for library size (or other biases)?

What I had in mind was taking the count output from "SalmonTE.py quant" to do my own lm/anovas and plots in R. It is just not clear to me if the estimated counts should be further corrected or not.

hyunhwan-jeong commented 5 years ago

As you've figured out, there is no normalization in do.lm function, and SalmonTE does not any normalization before the analysis. Therefore, if you have estimated counts and want to perform lm or anova, you will need to perform normalization. Do you have any suggestion of the count correction?

Hyun-Hwan Jeong

FabianDK commented 5 years ago

Thanks for clarifying this. Could this also be an issue for the statistics with DESeq? If one library has a systematically higher TE expression than another, then would DESeq wrongly assume that this is technical rather than biological? Thus, the sizeFactor in the DESeq correction could be wrong resulting in many false negatives? As far as I know, DESeq assumes that a majority of genes (or TEs in this case) are not differentially expressed.

There are two things that come to my mind, which potentially correct for this: First, the most accurate approach might be to include the whole repeat-masked reference genome. But this would complicate the whole methodology. Second, you could include a bunch of housekeeping genes in the fasta that you use for normalization, but it is probably hard to define true housekeeping genes that do not change across sex, age or other factors.

Daniel

Puputnik commented 5 years ago

Hello everyone. I completely agree with FabianDK, i see some problem that can arise from the fact that with SalmonTE the user is analysing the TEs only. It is easy to draw incorrect conclusions if the relative expression of all the TEs families vary considerably among patients. In my opinion, the preferable way to resolve this would be to make a reference that includes both the TEs sequences and the "normal" gene's transcripts (taken from ENSEMBLE or GENCODE). In this way the prerequisites for using packages such as DESeq2 should be respected, as technically we are just adding ~600 transcripts to the "classical" references that are used for such purpose. Considering the high speed of Salmon it shouldn't be very painful to run the analysis in this way, or at least, the loss in terms of computing speed would be largerly repayed by much stronger results.

Hyunhwaj can you please provide an index produced with a combination of both ENSEMBLE/GENCODE reference and the TEs reference taken from repbase? I would like to do that but recently repbase stopped being an open acces database.

I have an additional question: is SalmonTE doing some sort of TE-specific calculations in order to assess TEs expression or is it entirely based on Salmon? I mean, which is the difference in running salmon with a custom transcriptome index (made from repase reference) and running SalmonTE? Because, if there are any differences in the calculation methods this can be a problem for the kind of approach that I proposed before.

Thanks in advance

Filippo

hyunhwan-jeong commented 5 years ago

Dear @Puputnik and @FabianDK,

Thanks for your interests for our tool, and sorry for the late response. Based on @Puputnik's suggestion, I have created an index and uploaded it to a public space. You can download the index here(hs_all.fa): (Click the Box link), and place/decompress it to the reference folder in SalmonTE directory. I also recommend you to update SalmonTE (i.e. executing git pull).

The ID of index should be hs_all, and please be aware it is experimental, and we need to verify this really works.

Here is a response to a @FabianDK 's question:

Thus, the sizeFactor in the DESeq correction could be wrong resulting in many false negatives? As far as I know, DESeq assumes that a majority of genes (or TEs in this case) are not differentially expressed.

I previously checked the estimated sizeFactor between from genes only and from TEs only for the GTEx brain RNA-Seq dataset (~1,000 samples), and the bottom line of the comparison was, they are highly correlated. It was a brief observation, and I need to make it precisely. However, I believe that if you see there are outlier samples for TE in your dataset, these outlier samples will be problematic for gene-expressions, and this case won't be a treatable case by both DESeq2 and SalmonTE.


Here is a response to a @Puputnik's question:

I have an additional question: is SalmonTE doing some sort of TE-specific calculations in order to assess TEs expression or is it entirely based on Salmon? I mean, which is the difference in running salmon with a custom transcriptome index (made from repase reference) and running SalmonTE? Because, if there are any differences in the calculation methods this can be a problem for the kind of approach that I proposed before.

There is no TE-specific calculation, but I am preparing an update of the calculation for the next version of SalmonTE.

Kind Regards,

Hyun-Hwan Jeong

Puputnik commented 5 years ago

thank you @hyunhwaj I will give a try as soon as possible. So you merged a transcriptome reference and the TE reference? Can you please tell me which transcriptome reference did you use? So basically there is no difference in running SalmonTE or Salmon+"the reference you made", right?

Thanks a lot!

Best Regards

Puputnik commented 5 years ago

By the way @hyunhwaj can you please provide an alternative link for downloading the index? I don't have a box account and I can't make one. Or alternatively you can just upload it on git-hub.

Thanks!

hyunhwan-jeong commented 5 years ago

Hello @Puputnik,

So you merged a transcriptome reference and the TE reference?

Yes, it is the same thing as you suggested.

Can you please tell me which transcriptome reference did you use?

I used the latest version of cDNA reference of ENSMBL, I did not filter out any of transcripts from the reference. Let me know if you have any better suggestion.

So basically there is no difference in running SalmonTE or Salmon+"the reference you made", right?

That is correct.

By the way @hyunhwaj can you please provide an alternative link for downloading the index? I don't have a box account and I can't make one. Or alternatively you can just upload it on git-hub.

Thanks!

My bad! The link was not correct, and I have corrected. You can download it now without log-in!

Best,

Hyun-Hwan Jeong