kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

--one_outlier_per_iter does not work #63

Closed kfuku52 closed 3 years ago

kfuku52 commented 3 years ago

--one_outlier_per_iter is passed to transcriptome_curation.r as a string "1" or "0", neither numeric nor boolean. Consequently, if(one_out_per_iter == TRUE) always returns FALSE, so --one_outlier_per_iter doesn't change the behavior at all. @Hego-CCTB Did it work in your env?

kfuku52 commented 3 years ago

Partially fixed in https://github.com/kfuku52/amalgkit/commit/cdd71a0a19d6dfcccd2624104d02af28a576dd24 --one_outlier_per_iter is now properly passed to transcriptome_curation.r as INT so if(one_out_per_iter == TRUE) works. However, even with the correct parsing of the option, transcriptome_curation.r still removes multiple SRA Runs at a time.

iteratively checking within-tissue correlation, round: 7 
Excluding only one outlier per bioproject or same tissue. 
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA193691" "PRJNA656275" "PRJNA670620" "PRJNA668751"
[6] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833957"  "SRR789759"   "SRR12424227" "SRR12918257" "SRR12810599"
[6] "SRR4017793"  "SRR9007942"  "SRR9007981" 
round: 7 : # before = 728 : # after = 720 

iteratively checking within-tissue correlation, round: 8 
Excluding only one outlier per bioproject or same tissue. 
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA656275" "PRJNA670620" "PRJNA338450" "PRJNA541004"
[6] "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833964"  "SRR12424226" "SRR12918258" "SRR4017792"  "SRR9007941" 
[6] "SRR9007980" 
round: 8 : # before = 720 : # after = 714 

iteratively checking within-tissue correlation, round: 9 
Excluding only one outlier per bioproject or same tissue. 
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA656275" "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833966"  "SRR12424215" "SRR4017791"  "SRR9007940"  "SRR9007979" 
round: 9 : # before = 714 : # after = 709 

iteratively checking within-tissue correlation, round: 10 
Excluding only one outlier per bioproject or same tissue. 
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR4017790" "SRR9007939" "SRR9007978"
round: 10 : # before = 709 : # after = 706 

iteratively checking within-tissue correlation, round: 11 
Excluding only one outlier per bioproject or same tissue. 
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR4017789" "SRR9007938" "SRR9007977"
round: 11 : # before = 706 : # after = 703 
Hego-CCTB commented 3 years ago

It is possible for the function to remove multiple runs at a time, if the runs are from different bioprojects and different tissue.

So for example, it can remove 1 root sample and 1 flower sample in one go, if they come from two different bioprojects.

From the output it seems this is what's happening, since the number of bioprojects and the number of Samples are always equal.

I was under the impression that this was the intended behaviour from this comment. https://github.com/kfuku52/amalgkit/issues/13#issuecomment-867596731

That said, for me it's much easier to code 'true' 1 outlier per iteration.

kfuku52 commented 3 years ago

Thank you for clarifying. Did the current implementation solve the problem in Helianthus?

Hego-CCTB commented 3 years ago

I haven't redone the csca yet, but the problematic root samples were all successfully removed by the changes made to curate.

kfuku52 commented 3 years ago

Good! Could you share the curate outputs?

Hego-CCTB commented 3 years ago

Helianthus_annuus.zip

kfuku52 commented 3 years ago

Roots look better indeed, but SRR5803831 should have been removed too by the same change. Could you look into it?

Hego-CCTB commented 3 years ago

Roots look better indeed, but SRR5803831 should have been removed too by the same change. Could you look into it?

Looks like the sample barely makes the threshold. It's got a pearson r of 0.21 (same tissue that is. Other tissues were even worse). Maybe 0.2 threshold is too little?

kfuku52 commented 3 years ago

I see. Could you expose the threshold as a curate option so users can adjust it? The threshold value of 0.2 would indeed be too permissive. Let's set the tentative default value to 0.3. Once we have a better idea of similar cases, we can increase the value more.

Hego-CCTB commented 3 years ago

TPM_vs_FPKM_SVA.zip I was rerunning and playing around with the pearson threshold and tpm/fpkm to find a nice example for https://github.com/kfuku52/amalgkit/issues/64 And I found there is quite a noticeable difference between TPM_SVA and FPKM_SVA. Is this natural?

kfuku52 commented 3 years ago

The data look natural to me. What do you mean by the noticeable differene?

Hego-CCTB commented 3 years ago

The TPM final graph looks like it's performing better (I mixed up FPKM and TPM graphs in the .zip). The tissues look more uniform to me.

Hego-CCTB commented 3 years ago

Also TPM went through one more round of outlier removal.

kfuku52 commented 3 years ago

It is probably because Pearson's correlation coefficient is sensitive to outliers: i.e., extremely highly expressed genes. The sum of values is fixed in TPM, but not in FPKM, so FPKM tends to preserve outlier expression levels. Please look at Max expression. FPKM has a wider dynamic range. If you want to get rid of the effect of dynamic ranges, you can introduce an option for the method for cor() and choose Spearman instead of Pearson.

Hego-CCTB commented 3 years ago

That makes a lot of sense, thanks for the explanation!

Hego-CCTB commented 3 years ago

I see. Could you expose the threshold as a curate option so users can adjust it? The threshold value of 0.2 would indeed be too permissive. Let's set the tentative default value to 0.3. Once we have a better idea of similar cases, we can increase the value more.

Amalakgit ver. 0.6.2.2

https://github.com/kfuku52/amalgkit/commit/36feb7d5260e39f6afd95331cd9ec6dd3a70c96b