kfuku52 / csubst

Molecular convergence detection
BSD 3-Clause "New" or "Revised" License
25 stars 1 forks source link

Questions and general guidance #51

Open francicco opened 9 months ago

francicco commented 9 months ago

Hi,

It's the first time I'm trying to run this type of analysis, I guess I'd need to guidance. The test I'd like to run involves testing if the convergence is present among species with a defined phenotype in a set of single-copy OGs. Reading your very helpful wiki there are a few things that I should specify, but are not very clear to me.

  1. --foreground specifying the a tab-delimited file in which I specify an id (my choice?), and a corresponding species if (or a regexp). But I'm a bit confused because in the PEPC dataset all the sequences in foreground.txt have the same id in the first column:
1   Alternanthera_pungens_AY950665
1   Amaranthus_hypochondriacus_Z68125
1   Bienertia_sinuspersici_DQ538352
1   Cyperus_sanguinolentus_FM208046
1   Eleocharis_vivipara_AB085948
1   Fimbristylis_littoralis_FM208037
1   Flaveria_australasica_Z25853
1   Setaria_italica_AF495586
1   Sorghum_bicolor_X17379
1   Suaeda_aralocaspica_DQ538353
1   Suaeda_eltonica_DQ538354
1   Zea_mays_X15239

also, are the sequences that I want to test the convergence all listed in the foreground.txt?

  1. There are other flags that can be applied such as --max_arity or --fg_exclude_wg don't quite understand when to use them and which value to set. It seems like arity (k) is the number of branches I want to compare, if I set 2, csubst will only give pairwise comparisons, but if I set a higher number it will output the results up to that number. For example, if I have 10 species to compare and I set --max_arity to 10, the csubst_cb_10.tsv will show if the convergence happened among all the 10 species, correct?

  2. Which threshold to use and how to interpret the results? If I use a threshold like in one of your example: --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0', does this mean that all the results in csubst_cb_k.tsv will indicate the branches under convergence? Is there any associate pvalue, or something similar?

Enough question atm. Thanks a lot for your help Best F

kfuku52 commented 9 months ago

But I'm a bit confused because in the PEPC dataset all the sequences in foreground.txt have the same id in the first column:

The same ID works fine with --fg_exclude_wg no. However, if you prefer to assign different IDs to individual lineages, you do not need the --fg_exclude_wg specification. While both settings lead to a similar analysis, there are subtle differences. For instance, in the first scenario, convergence within a lineage comprising multiple species is analyzed, which isn't the case in the second scenario.

also, are the sequences that I want to test the convergence all listed in the foreground.txt?

If I've understood you correctly, all genes of interest should indeed be listed in this file. Failing to do so makes it challenging to identify foreground branch combinations in the CSUBST output files.

For example, if I have 10 species to compare and I set --max_arity to 10, the csubst_cb_10.tsv will show if the convergence happened among all the 10 species, correct?

The csubst_cb_10.tsv file will show if there is convergence among all 10 species, but this is contingent on a couple of conditions. This outcome is expected if all 10 species are independent (with no sister pairs) and if there is a 9- or 10-way convergence that meets the thresholds for a higher-order convergence analysis (--cutoff_stat).

If I use a threshold like in one of your example: --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0', does this mean that all the results in csubst_cb_k.tsv will indicate the branches under convergence?

It is important to note that not all results in csubst_cb_K.tsv will necessarily indicate branches under convergence. Only a subset of branch combinations in this file may meet the specified thresholds, and it could include non-convergent combinations as well. However, all convergence metrics are provided in the file, so you can verify each one.

Is there any associate pvalue, or something similar?

No, currently not supported.

francicco commented 9 months ago

Hi @kfuku52,

Thanks a lot. Now it's much more clear. Regarding the output. It looks extremely complex, with many values. Can you give me some tips on where to start? if I want to show a pattern of convergence in a paper for example, which values would be the best to report?

Thanks a lot F

kfuku52 commented 9 months ago

Yes, please check this document out. https://github.com/kfuku52/csubst/wiki/Interpreting-csubst-outputs

francicco commented 9 months ago

Hi @kfuku52

I'm assigning foreground lineages the same id: 1. To be more precise I also set as foreground the parent nodes of the teminal branches when foreground leaves are monophiletic. and executing this command:

csubst analyze \
                --alignment_file $OG.csubst.fasta \
                --rooted_tree_file $OG.csubst.tree \
                --foreground $OG.foreground.txt \
                --fg_exclude_wg no \
                --fg_stem_only no \
                --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' \
                --max_arity 10 \
                --exhaustive_until 1

Unfortunately csubst will only generate arity of 2, which is something I don't want, because I might have more than 2 internal branches that converge.

Screenshot 2023-11-24 at 10 28 37 AM

For example, in this gene tree I'd like to test also the roots of the 3 highlighted branches.

If I then assign to all the branches highlighted a different id and I execute csubst without --fg_exclude_wg, increasing --exhaustive_until, csubst will test all branches.

I do I fix that? Cheers F

kfuku52 commented 9 months ago

Could you please clarify by visually indicating the branches you wish to analyze? I want to ensure I fully understand your question.

francicco commented 9 months ago

I'd want to test convergence among the clades highlighted in pink. The three of them have the same phenotype that I'd like to test for convergence. F

kfuku52 commented 9 months ago

Then your command seems correct. Please note that CSUBST does not produce outputs for K>=3 if there is no convergent combinations at K=2.

francicco commented 9 months ago

Yes, but I have a lot of branches it seems odd it only finds pairs never a triplet. At the moment it processed 170 genes, and not a single triplet? Odd

kfuku52 commented 9 months ago

Please send your data if you want me to check it.

francicco commented 9 months ago

Thanks a lot!

OG_10275.foreground.txt OG_10275.csubst.tree.txt OG_10275.csubst.fasta.txt

F

kfuku52 commented 9 months ago

There is only one (1/176) convergent branch combination that satisfies --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' at K=2. Please note that at least 2 branch combinations are required to proceed to the K+1 search.

branch_id_1 branch_id_2 dist_bl dist_node_num branch_num_fg branch_num_mg is_fg is_mf is_mg omegaCany2any omegaCany2dif omegaCany2spe omegaCdif2any omegaCdif2dif omegaCdif2spe omegaCspe2any omegaCspe2dif omegaCspe2spe dNCany2any dNCany2dif dNCany2spe dNCdif2any dNCdif2dif dNCdif2spe dNCspe2any dNCspe2dif dNCspe2spe dSCany2any dSCany2dif dSCany2spe dSCdif2any dSCdif2dif dSCdif2spe dSCspe2any dSCspe2dif dSCspe2spe OCNCoD OCSCoD OCN_linreg_residual OCS_linreg_residual N_sub_1 N_sub_2 S_sub_1 S_sub_2 OCNany2any OCNany2dif OCNany2spe OCNdif2any OCNdif2dif OCNdif2spe OCNspe2any OCNspe2dif OCNspe2spe OCSany2any OCSany2dif OCSany2spe OCSdif2any OCSdif2dif OCSdif2spe OCSspe2any OCSspe2dif OCSspe2spe ECNany2any ECNany2dif ECNany2spe ECNdif2any ECNdif2dif ECNdif2spe ECNspe2any ECNspe2dif ECNspe2spe ECSany2any ECSany2dif ECSany2spe ECSdif2any ECSdif2dif ECSdif2spe ECSspe2any ECSspe2dif ECSspe2spe
68 71 0.0995 6 2 0 Y N N 6.0697 0.4132 44.6273 1.3437 0.6687 333.0791 8.9314 162.6748 44.6695 22.1525 1.4071 170.195 19.7146 9.9032 164.2055 22.4753 0.1951 170.5806 3.6497 3.4054 3.8137 14.6714 14.8106 0.493 2.5164 0.0012 3.8187 16.9493 1.6683 0.6952 0.0757 4.975 2.4457 1.6663 2.0483 2.4159 0.1346 2.2813 0.2514 0.1183 0.1332 2.1645 0.0163 2.1481 0.2263 0.0848 0.1415 0.0848 0.0848 0 0.1415 0 0.1414 0.1091 0.0957 0.0134 0.0128 0.0119 0.0008 0.0963 0.0837 0.0126 0.062 0.0249 0.0371 0.0058 0.0057 0.0001 0.0562 0.0192 0.037

Also, your foreground specification doesn't look what you've described.

image
francicco commented 9 months ago

What command did you execute? I tried giving the same id to all foreground (the file I sent you) and giving them different ids, from 1 to number_of_lineages. F

kfuku52 commented 9 months ago

I used your command as is.

francicco commented 9 months ago

This is what I get with: csubst analyze --alignment_file $OG.csubst.fasta --rooted_tree_file $OG.csubst.tree --foreground $OG.foreground.txt --fg_exclude_wg no --fg_stem_only no --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' --max_arity 10 --exhaustive_until 1 csubst_cb_2.tsv.txt

If I execute this command with the input I sent you I never found a single triplet in ~170 genes.

If you confirm me that this configuration does what I'd like to do I'm ok with that.

Thanks a lot! F

kfuku52 commented 9 months ago

The output looks identical to mine.

If you confirm me that this configuration does what I'd like to do I'm ok with that.

Your foreground specification doesn't look what you've described. See the tree above.

francicco commented 9 months ago

Ooooh, I think I know what you mean. My figure and your! yes I think I send you another gene tree. Sorry my bad!

I'll keep the analyses running this way then. Is that ok if I update you later on the outcome?

Thanks a lot F

kfuku52 commented 9 months ago

Sure!

francicco commented 7 months ago

Hi @kfuku52,

I finish to run csubst on my genes. Can I ask you few questions on how to interpret the data? Maybe I can send you an email?

Cheers F

kfuku52 commented 7 months ago

I prefer to make the discussion open to the public on GitHub, but if there is any confidential information involved, please feel free to send me an email.

francicco commented 7 months ago

No worries, I'll explain the plan. I have this phenotype and I want to identify candidate genes in a dataset of single-copy OGs. I'm doing 3 analyses.

  1. BUSTED-PH - Species w/ phenotype Vs no-phenotype
  2. RELAX - Species w/ phenotype Vs no-phenotype
  3. CSUBST - Species w/ phenotype

The first question would be: what is the best threshold for CSUBST. If I'm interpreting correctly the wiki here, in the paper you used OCNany2spe = 3.0 and omegaCany2spe = 3.0 while the default for CSUBST is OCNany2spe = 2.0 and omegaCany2spe = 5.0. Since there's no pvalue associated to these values I'm not sure what threshold I should pick.

Because I didn't know the answer I explored the data and I did the following test. RELAX returns me a list of OG that show a putative signal of relaxation, intensification or no shift. My hypothesis is the following: In the genes with no shift in the selective pressure I don't expect to have strong convergence, when the relaxation occurs this can be due to a selective pressure that is not present anymore, genes accumulate more mutations that do not necessarily have a pattern of convergence, while the when the intensification occur genes are more likely to experience convergence. To test that I've run csubst with OCNany2spe = 1.0 and omegaCany2spe = 1.0, and divided the genes into no-shift, relaxed and intensified, and then I checked if there's a shift in the distribution of omegaCany2spe between the groups. I obtained 2047 with no-shift, 1224 with relaxation and 122 with intensification and it looks like the 122 genes with intensification had a significant shift in the distribution (p-value = 0.005) compared to both, with a median of 19.45 for both no-shift and relaxed vs 55.27 for the intensified genes. Does this make sense? How can I further refine the csubst analysis? Is this approach somehow correct?

Cheers F

kfuku52 commented 7 months ago

Thresholds should be adjusted based on the analysis type. For branch pair analysis (K = 2), I recommend using thresholds of OCNany2spe >= 3.0 and omegaCany2spe >= 3.0. For higher-order convergence analysis (K >= 3), setting OCNany2spe >= 2.0 and omegaCany2spe >= 5.0 could be beneficial.

It is logical to anticipate a link between intensified selection and adaptive convergence, and your analysis methodology seems solid. To enhance the robustness of your findings, it would be advisable to compare the dS_C values among the three groups. A higher dS_C in the intensified selection group may indicate an artifact. Conversely, similar dS_C values across groups would strengthen the validity of your claims.

francicco commented 7 months ago

Hi,

I checked dSCany2spe the medians for no-shoft and Relaxed is 0.3678, while in the Intensified group is lower, 0.143. The difference is slightly significant, with P = 0.046 one-sided, where no-shift and Relaxed are higher than Intensified. So, the opposite is as in cases where artifacts are present, if I understood correctly. F

kfuku52 commented 7 months ago

Sounds good, but as I understand it, the one-sided test should aim to determine whether the intensified group exhibits a higher dS_C compared to the other two groups, to investigate the potential presence of artifacts, unless there are specific hypotheses suggesting that the intensified group could have a lower dS_C.

francicco commented 7 months ago

Yes, the wilcoxon test two-sided/one-sided (greater) shows no significance, if there's a difference is where dSCany2spe is higher in the no-shift and Relaxed groups compared with Intensified.