kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

new SVA #12

Open kfuku52 opened 4 years ago

kfuku52 commented 3 years ago

Please check this paper out. We want to compare the performance of the SVA correction of log-transformed counts (current implementation), and newer methods designed specifically for non-log-transformed RNA-seq raw count data (ComBat-seq, SVA-seq, RUV-seq). https://academic.oup.com/nargab/article/2/3/lqaa078/5909519

Hego-CCTB commented 3 years ago

Any suggestion as to what data set I should use for testing/benchmarking?

kfuku52 commented 3 years ago

You would have to generate your own dataset. You can use Monodelphis domestica from my paper but it's more straight forward to use a plant species you study.

Hego-CCTB commented 3 years ago

I'm looking into the implementation of ComBat-seq now.

Hego-CCTB commented 3 years ago

In order to use CombatSeq I have to provide a "Batch" vector (I can include other parameters, such as tissues as a "group" parameter.).

What would constitute as the batch in SRA metadata? The Bioproject?

kfuku52 commented 3 years ago

BioProject tends to be the strongest predictor of SVs, so let's try BioProject alone first.

Hego-CCTB commented 3 years ago

The data set I'm using is Zea mays (my largest dataset bot in terms of individual runs and total transcripts). Matrix size is 147x131585.

This is how SVA in its current form performs (final output): Runtime for final SVA-adjustment: 139.065 seconds

Zea_mays.3.correlation_cutoff.sva.pdf

Here are some PCAs of CombatSeq Results: Runtime for single adjustment by CombatSeq: 92.961 seconds.

RAW COUNT PCA image COMBAT SEQ ADJUSTED COUNTS image COMBAT SEQ ADJUSTED COUNTS AND SUBSEQUENT LOG-FPKM TRANSFORMATION image

This is how it looks if you run CombatSeq together with iterative outlier removal (adjusted, but untransformed):

Zea_mays.4.correlation_cutoff.cbs.pdf

After applying log-FPKM transform on the adjusted counts, it actually looks like the dataset improved. However, SVA is much more powerful in terms of separation.

The big advantage of CombatSeq is, that it preserves the discrete nature of the counts, which some downstream analysis packages need (for example DEseq2).

kfuku52 commented 3 years ago

Thanks! It looks like the program worked correctly, which is nice. We should keep this result for a side-by-side comparison in the eventual publication. Please try SVA-seq next. I hope it outperforms SVA as it should be for RNA-seq data.

Hego-CCTB commented 3 years ago

Yup, moving on to SVA-seq.

Another note:

I have tried to include other covariates as well (the same ones that SVA is currently adjusting for), but I get the error: "At least one covariate is confounded with batch! Please remove confounded covariates and rerun ComBat-Seq". I could not determine what this actually means and how to fix it. Also CombatSeq can not handle cases, where there is only one sample (i.e. run) for a given batch (i.e. bioproject).

kfuku52 commented 3 years ago

That could mean that two or more covariates happened to be identical combinations or too similar to each other, but let's get SVA-seq done first.

Hego-CCTB commented 3 years ago

OK, SVA-seq implementation is really easy, since it uses the exact same inputs as SVA, it's just tailored to raw counts instead of transformed ones. Only difference in code is literally just the function name.

Run time of SVA-seq: 131.88 seconds

image

Separation looks really nice and PC1 already accounts for almost all of the variance, meaning batch effect is almost completely sorted out.

Caveat of this is, that FPKM/TPM transformation will not be possible afterwards, because SVAseq can produce negative values. I also have read that Svaseq applies log transformation as a first step in the algorithm.

Iterative outlier removal looks a bit weird right now, not entirely sure what the issue is, but the above data suggests SVAseq might be a good alternative. Zea_mays.4.correlation_cutoff.sva.pdf

kfuku52 commented 3 years ago

Hmm... the boxplot looks weird indeed. Did you apply log2 transformation after the SVAseq correction before calculating Pearson's correlation coefficient for the iterative outlier removal? Results may be biased by highly expressed genes if not.

Hego-CCTB commented 3 years ago

Yeah, that might be the issue here.

I did not perform log transform, since I assumed SVAseq does that itself (it's the first thing mentioned in the manual). However, it looks like it undoes log transformation before returning the data. The parabolic shape of the PCA is an indicator that as well.

Hego-CCTB commented 3 years ago

RUVseq results: RUVseq offers different approaches for variation elimination. The one used here is based on calculating empirical control genes first (housekeeping genes, basically), which are then used for RUVseq. Runtime: 87.496 seconds This PCA is AFTER FPKM+log transformation of the counts normalized by RUVseq: image

Separation looks decent, however, total explained variation is a bit low across the two PCs. Also RUVseq is not compatible with Kallisto data, since Kallisto produces non-integer counts, which RUVseq does not seem to like (I just rounded the counts here to avoid having to make a new dataset, so results are technically falsified slightly).

Hego-CCTB commented 3 years ago

This is RUVseq estimating the factors of unwanted variation using residuals, also after FPKM+log transform: image

We get a bit more variation explained this way, but separation doesn't look as strong.

Hego-CCTB commented 3 years ago

Just for completeness, here is the current SVA implementation: Runtime 151.142 seconds image

kfuku52 commented 3 years ago

Thanks! Can you make a quantitative comparison between the methods using the boxplots like this? image

Hego-CCTB commented 3 years ago

on it!

Hego-CCTB commented 3 years ago

So, I'm at a loss here. After calculating pearson's correlation, it turn's out that SVAseq produces massive negative correlations outside of same-tissue areas.

Current SVA: SVA SVA_heat

SVAseq: SVAseq

All that's white here is negative: SVAseq_heat RUVseq, no log, no FPKM: RUVseq RUVseq_heat RUVseq + log: RUV-log RUV-log_heat

CombatSeq: CombatSeq CombatSeq_heat

kfuku52 commented 3 years ago

Are you comparing them in the same unit (i.e., log-FPKM)?

Hego-CCTB commented 3 years ago

SVA and CombatSeq have log-FPKM applied, SVAseq and RUVseq are incompatible with FPKM calculations, so they are only logarithmized. There are two different RUV packages (different authors as RUVseq), which I haven't tried yet, but are supposedly able to handle log-FPKM as input.

kfuku52 commented 3 years ago

Could you try to force log-FPKM-ish transformation for SVA-/RUV-seq? We can't compare apples and oranges so the values should be adjusted as much as possible when compared, even though the transformation procedure may not be perfect. I guess a big problem for FPKM would be the presence of negative values after the SVA-/RUV-seq correction. If negative values are seen in a minor fraction of genes (up to several hundred genes in a genome), you can round them to zero to enable an FPKM-like transformation.

Hego-CCTB commented 3 years ago

this worked for SVAseq, by setting all negative values to 0: SVAseq_log_FPKM SVAseq_Heat_log_fpkm

RUVseq, however, becomes weird. It pretty much equalizes the whole dataset. RUVseq-log_fpkm RUVseq_log_fpkm_heat

kfuku52 commented 3 years ago

Thanks! I wounder if the majority of genes are negative in RUV-seq. Anyway, SVA and SVA-seq seem the two best methods so far. Could you compare SVA, SVA-seq, and no-correction side-by-side with the same Y axis?

Hego-CCTB commented 3 years ago

RUVseq doesn't produce negative values at all. But it might be, that it doesn't handle low expressed genes very well. I only remove unexpressed genes, but I leave lowly expressed ones in. Comparing the raw counts with the normalized ones, it looks like the data was centrized: genes with counts of 2 get bumped up to 30 and genes with counts of 700 go down to 100. If you then apply log-FPKM, everything gets into the range of one order of magnitude.

I suspect this is a combination of low expressed genes and the upper quartile normalization RUVseq applies. I'll try removing low expressed genes as well.

kfuku52 commented 3 years ago

Are you still working on it? If not, please close this issue with a brief summary.

kfuku52 commented 1 year ago

Batch correction paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8895431/

kfuku52 commented 1 year ago

Here is another batch correction paper: https://www.nature.com/articles/s41587-022-01440-w

kfuku52 commented 1 year ago

@Hego-CCTB I will take care of it if you don't have time.

Hego-CCTB commented 1 year ago

I'll keep working on this. Give me a week!

Hego-CCTB commented 1 year ago

OK. I have implemented svaseq, combatseq and ruvseq. I ran all different batch-effect-removal algorithms (including sva) on 29 different plant species (I'll run the animal set as well this weekend). Short explanation on how these were implemented and the logic behind it. I'll show a case here, where all algorithms perform well, but this will not be the full picture, of course. All plots you see show log2-FPKM transformed values.

uncorrected values (just log2fpkm transformed)

Prunus_persica.2.correlation_cutoff.pdf

sva

Nothing has changed. This will be the correction everything will be compared against.

Prunus_persica.2.correlation_cutoff.sva.pdf

svaseq

Identical to sva, except it's supposed to take in counts, rather than log-transformed counts. The documentation states that the first thing svaseq does is to apply a log(x +c) transformation, where X is the count matrix and C is a constant (defaults to 1) . So the initial logic was:

  1. apply FPKM
  2. apply temporary log-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log-transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log-transformation on the last iteration

These were the results: Prunus_persica.2.correlation_cutoff.svaseq.pdf

HOWEVER: When I treat svaseq exactly the same way as sva. So log2FPKM BEFORE any calculations start, it performs almost identical to sva, sometimes even slightly outperforms it. I have no explanation as to why this happens. Prunus_persica.2.correlation_cutoff.svaseq.log_before_sva.pdf

combatseq

combatseq relies on known sources of batch effects. Since from experience the biggest source is almost always bioproject (at least judging from sva output), that's what I tell combatseq to look for. Additionally, you can supply a covariate which represents the biological signal (curate_group in our case). Combatseq will try to preserve this signal. Also of note: combatseq can not handle cases, where a batch contains only one sample (i.e. cases where there is only 1 sample for a given bioproject), so I need to remove those bioprojects from the dataset.

The implementation logic is to run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

  1. no FPKM OR log transformation
  2. apply temporary log2FPKM-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.combatseq.pdf

RUVseq

RUVseq needs control genes (i.e. genes that don't change expression across conditions/treatments/tissues) to find unwanted variation. There are a lot of different ways to do this:

The last one (residuals) is what I went with. The outlier-removal/tissue-correlation logic is the same as in Combatseq:run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

  1. no FPKM OR log transformation
  2. apply temporary log2FPKM-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.ruvseq.pdf

Hego-CCTB commented 1 year ago

Again, I want to stress that this is an ideal use case for these algorithms. Especially Combatseq performed exceptionally. However, Combat and RUV often struggle in larger datasets:

Oryza sativa

Oryza_sativa.5.correlation_cutoff.pdf Oryza_sativa.5.correlation_cutoff.ruvseq.pdf Oryza_sativa.5.correlation_cutoff.sva.pdf Oryza_sativa.5.correlation_cutoff.svaseq.log_before_sva.pdf Oryza_sativa.5.correlation_cutoff.combatseq.pdf

Vitis vinifera

Vitis_vinifera.2.correlation_cutoff.svaseq.pdf Vitis_vinifera.2.correlation_cutoff.combatseq.pdf Vitis_vinifera.2.correlation_cutoff.pdf

Vitis_vinifera.2.correlation_cutoff.ruvseq.pdf Vitis_vinifera.2.correlation_cutoff.sva.pdf

kfuku52 commented 1 year ago

Great! CombatSeq seems to be a good choice when single-sample BioProjects can be excluded. We should probably use sva in curate by default and CombatSeq as an option.

Especially Combatseq performed exceptionally.

This would be great to discuss in the amalgkit paper.

Hego-CCTB commented 1 year ago

Yeah, sva (or svaseq) should remain default. They perform consistently well independent of dataset. What bugs me is why svaseq behaves so well when I feed it the supposedly wrong input (i.e. log-transformed values instead of counts). I feel like I'm missing or misunderstood something.

I haven't had a chance to look at all 29 species yet, but the pattern really seems to be the size of the dataset (i.e. the number of batch variables) that determines how well Combatseq and RUVseq perform. This could mean that combatseq may be outperforming sva for private/local fastq projects like my carnivore set. I'll have a look what happens when I run combatseq on that.

Combatseq can potentially be improved, as it is possible to add more covariates. As for RUVseq, I'd like to try out the 'least significant differentially expressed genes' as control genes. Maybe it will improve performance on larger sets. What seems to be happening is over-correction. Maybe the GLM the residuals stem from isn't robust enough to handle many different bioprojects, but that is pure speculation.

kfuku52 commented 1 year ago

Could you expose this functionality as a new option in curate?

kfuku52 commented 1 year ago

The only difference between sva and svaseq appears to be whether the function automatically converts the input expression data with the logn(x+1) transformation. https://rdrr.io/bioc/sva/src/R/sva.R https://rdrr.io/bioc/sva/src/R/svaseq.R

amalgkit curate already supports various log transformations via --norm, so the use of svaseq does not make sense. I will remove --batch_effect_alg svaseq.

kfuku52 commented 1 year ago

@Hego-CCTB In https://github.com/kfuku52/amalgkit/issues/12#issuecomment-1508320485, I suspect that the file Oryza_sativa.5.correlation_cutoff.combatseq.pdf does not represent combatseq-corrected data. PRJNA404045 appears in the plot even though it only contains one sample. This sample should have been excluded if combatseq had been correctly applied. This issue may have arisen because the original, uncorrected data was returned when combatseq failed. You can see this at https://github.com/kfuku52/amalgkit/blob/772c1901f14d2fb7e5571005ab7f940210d74eb2/amalgkit/curate.r#L406-L410

RUVseq might encounter a similar issue. As a temporary measure, I will deactivate it so that amalgkit returns an error when batch correction fails.

Hego-CCTB commented 10 months ago

I'll investigate this. Something is happening to the matrix, but that could just be the log that's applied before plotting. Also just so we are on the same page: the sample (and BP) removal happens independently and before the batch-effect removal. So if the sample is there, it means something in check_within_curate_group_correlation() went wrong, rather than in batch_effect_subtraction().

EDIT: PRJNA404045 is present in the sva corrected output as well.

Hego-CCTB commented 6 months ago

I would like to be able to discuss, or at least mention that RUV and ComBat are available for amalgkit in the paper, so I'll try to get this issue sorted this week.

kfuku52 commented 2 months ago

Have you had a chance to investigate this issue?