Closed kfuku52 closed 3 years ago
Partially fixed in https://github.com/kfuku52/amalgkit/commit/cdd71a0a19d6dfcccd2624104d02af28a576dd24
--one_outlier_per_iter is now properly passed to transcriptome_curation.r
as INT so if(one_out_per_iter == TRUE)
works. However, even with the correct parsing of the option, transcriptome_curation.r
still removes multiple SRA Runs at a time.
iteratively checking within-tissue correlation, round: 7
Excluding only one outlier per bioproject or same tissue.
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA193691" "PRJNA656275" "PRJNA670620" "PRJNA668751"
[6] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833957" "SRR789759" "SRR12424227" "SRR12918257" "SRR12810599"
[6] "SRR4017793" "SRR9007942" "SRR9007981"
round: 7 : # before = 728 : # after = 720
iteratively checking within-tissue correlation, round: 8
Excluding only one outlier per bioproject or same tissue.
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA656275" "PRJNA670620" "PRJNA338450" "PRJNA541004"
[6] "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833964" "SRR12424226" "SRR12918258" "SRR4017792" "SRR9007941"
[6] "SRR9007980"
round: 8 : # before = 720 : # after = 714
iteratively checking within-tissue correlation, round: 9
Excluding only one outlier per bioproject or same tissue.
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA438231" "PRJNA656275" "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR6833966" "SRR12424215" "SRR4017791" "SRR9007940" "SRR9007979"
round: 9 : # before = 714 : # after = 709
iteratively checking within-tissue correlation, round: 10
Excluding only one outlier per bioproject or same tissue.
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR4017790" "SRR9007939" "SRR9007978"
round: 10 : # before = 709 : # after = 706
iteratively checking within-tissue correlation, round: 11
Excluding only one outlier per bioproject or same tissue.
Partially removed BioProjects due to low within-tissue correlation:
[1] "PRJNA338450" "PRJNA541004" "PRJNA541005"
Removed Runs due to low within-tissue correlation:
[1] "SRR4017789" "SRR9007938" "SRR9007977"
round: 11 : # before = 706 : # after = 703
It is possible for the function to remove multiple runs at a time, if the runs are from different bioprojects and different tissue.
So for example, it can remove 1 root sample and 1 flower sample in one go, if they come from two different bioprojects.
From the output it seems this is what's happening, since the number of bioprojects and the number of Samples are always equal.
I was under the impression that this was the intended behaviour from this comment. https://github.com/kfuku52/amalgkit/issues/13#issuecomment-867596731
That said, for me it's much easier to code 'true' 1 outlier per iteration.
Thank you for clarifying. Did the current implementation solve the problem in Helianthus?
I haven't redone the csca
yet, but the problematic root samples were all successfully removed by the changes made to curate
.
Good! Could you share the curate
outputs?
Roots look better indeed, but SRR5803831 should have been removed too by the same change. Could you look into it?
Roots look better indeed, but SRR5803831 should have been removed too by the same change. Could you look into it?
Looks like the sample barely makes the threshold. It's got a pearson r of 0.21 (same tissue that is. Other tissues were even worse). Maybe 0.2 threshold is too little?
I see. Could you expose the threshold as a curate
option so users can adjust it? The threshold value of 0.2 would indeed be too permissive. Let's set the tentative default value to 0.3. Once we have a better idea of similar cases, we can increase the value more.
TPM_vs_FPKM_SVA.zip I was rerunning and playing around with the pearson threshold and tpm/fpkm to find a nice example for https://github.com/kfuku52/amalgkit/issues/64 And I found there is quite a noticeable difference between TPM_SVA and FPKM_SVA. Is this natural?
The data look natural to me. What do you mean by the noticeable differene?
The TPM final graph looks like it's performing better (I mixed up FPKM and TPM graphs in the .zip). The tissues look more uniform to me.
Also TPM went through one more round of outlier removal.
It is probably because Pearson's correlation coefficient is sensitive to outliers: i.e., extremely highly expressed genes. The sum of values is fixed in TPM, but not in FPKM, so FPKM tends to preserve outlier expression levels. Please look at Max expression
. FPKM has a wider dynamic range. If you want to get rid of the effect of dynamic ranges, you can introduce an option for the method for cor()
and choose Spearman instead of Pearson.
That makes a lot of sense, thanks for the explanation!
I see. Could you expose the threshold as a
curate
option so users can adjust it? The threshold value of 0.2 would indeed be too permissive. Let's set the tentative default value to 0.3. Once we have a better idea of similar cases, we can increase the value more.
--correlation_threshold
argument (default 0.3) --mapping_rate
expected int
as input, so the user couldn't actually change the mapping rate cutoffhttps://github.com/kfuku52/amalgkit/commit/36feb7d5260e39f6afd95331cd9ec6dd3a70c96b
--one_outlier_per_iter is passed to
transcriptome_curation.r
as a string "1" or "0", neither numeric nor boolean. Consequently,if(one_out_per_iter == TRUE)
always returns FALSE, so --one_outlier_per_iter doesn't change the behavior at all. @Hego-CCTB Did it work in your env?