kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

nlog vs log2 #100

Closed kfuku52 closed 1 year ago

kfuku52 commented 2 years ago

I just found this in transcriptome_curation.r. In gene expression studies, a common practice is to use log2 rather than natural log (my paper too). I will change it to log2 in the next push but let me know if we discussed it already and made a decision to use nlog.

if (transform_method == "fpkm") {
    tc <- transform_raw_to_fpkm(tc, tc_eff_length)
    tc <- log(tc + 1)
}
if (transform_method == "tpm") {
    tc <- transform_raw_to_tpm(tc, tc_eff_length)
    tc <- log(tc + 1)
}
kfuku52 commented 2 years ago

Pushed. This change may influence the consistency between your previous data and new ones. Please redo the curate if necessary.

docxology commented 2 years ago

Hello, thank you for the update.

A few questions, feel free to address however you see fit.

  1. Is it OK to utilize the logn for the current analysis (e.g. something we've analyzed using the version of amalgkit before this update), and just mention in the Methods that we used the natural log rather than log2?
  2. Do you think that this log2/logn difference would influence e.g. rank ordering of expression levels within a tissue, or statistical significance testing?

And more generally:

  1. Could the base of exponentiation, be a parameter chosen by the amalgkit user? Because I could see a relevant interpretation of log2, logn, log10, and non-log for tissue-specific TPM levels.

Let me know if I can provide any more information, or test anything specific.

kfuku52 commented 2 years ago

Is it OK to utilize the logn for the current analysis (e.g. something we've analyzed using the version of amalgkit before this update), and just mention in the Methods that we used the natural log rather than log2?

Yes, this is no problem. The reason to use log2 is just for consistency with other genomics papers.

Do you think that this log2/logn difference would influence e.g. rank ordering of expression levels within a tissue, or statistical significance testing?

There will be no problem as far as I know.

Could the base of exponentiation, be a parameter chosen by the amalgkit user? Because I could see a relevant interpretation of log2, logn, log10, and non-log for tissue-specific TPM levels.

Good idea! @Hego-CCTB Could you organize these options? If you don't have time to do it this week, I will.

kfuku52 commented 2 years ago

@Hego-CCTB I'll take care of it.

kfuku52 commented 2 years ago

@C20H25N30 log_n(FPKM+1) transformation is supported in the latest version. It's available with amalgkit curate --norm lognp1-fpkm. Please let me know whether it works well.