Open francicco opened 9 months ago
But I'm a bit confused because in the PEPC dataset all the sequences in foreground.txt have the same id in the first column:
The same ID works fine with --fg_exclude_wg no
. However, if you prefer to assign different IDs to individual lineages, you do not need the --fg_exclude_wg
specification. While both settings lead to a similar analysis, there are subtle differences. For instance, in the first scenario, convergence within a lineage comprising multiple species is analyzed, which isn't the case in the second scenario.
also, are the sequences that I want to test the convergence all listed in the foreground.txt?
If I've understood you correctly, all genes of interest should indeed be listed in this file. Failing to do so makes it challenging to identify foreground branch combinations in the CSUBST output files.
For example, if I have 10 species to compare and I set --max_arity to 10, the csubst_cb_10.tsv will show if the convergence happened among all the 10 species, correct?
The csubst_cb_10.tsv file will show if there is convergence among all 10 species, but this is contingent on a couple of conditions. This outcome is expected if all 10 species are independent (with no sister pairs) and if there is a 9- or 10-way convergence that meets the thresholds for a higher-order convergence analysis (--cutoff_stat
).
If I use a threshold like in one of your example: --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0', does this mean that all the results in csubst_cb_k.tsv will indicate the branches under convergence?
It is important to note that not all results in csubst_cb_K.tsv will necessarily indicate branches under convergence. Only a subset of branch combinations in this file may meet the specified thresholds, and it could include non-convergent combinations as well. However, all convergence metrics are provided in the file, so you can verify each one.
Is there any associate pvalue, or something similar?
No, currently not supported.
Hi @kfuku52,
Thanks a lot. Now it's much more clear. Regarding the output. It looks extremely complex, with many values. Can you give me some tips on where to start? if I want to show a pattern of convergence in a paper for example, which values would be the best to report?
Thanks a lot F
Yes, please check this document out. https://github.com/kfuku52/csubst/wiki/Interpreting-csubst-outputs
Hi @kfuku52
I'm assigning foreground lineages the same id: 1. To be more precise I also set as foreground the parent nodes of the teminal branches when foreground leaves are monophiletic. and executing this command:
csubst analyze \
--alignment_file $OG.csubst.fasta \
--rooted_tree_file $OG.csubst.tree \
--foreground $OG.foreground.txt \
--fg_exclude_wg no \
--fg_stem_only no \
--cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' \
--max_arity 10 \
--exhaustive_until 1
Unfortunately csubst
will only generate arity of 2, which is something I don't want, because I might have more than 2 internal branches that converge.
For example, in this gene tree I'd like to test also the roots of the 3 highlighted branches.
If I then assign to all the branches highlighted a different id and I execute csubst
without --fg_exclude_wg
, increasing --exhaustive_until
, csubst
will test all branches.
I do I fix that? Cheers F
Could you please clarify by visually indicating the branches you wish to analyze? I want to ensure I fully understand your question.
I'd want to test convergence among the clades highlighted in pink. The three of them have the same phenotype that I'd like to test for convergence. F
Then your command seems correct. Please note that CSUBST does not produce outputs for K>=3 if there is no convergent combinations at K=2.
Yes, but I have a lot of branches it seems odd it only finds pairs never a triplet. At the moment it processed 170 genes, and not a single triplet? Odd
Please send your data if you want me to check it.
There is only one (1/176) convergent branch combination that satisfies --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0'
at K=2. Please note that at least 2 branch combinations are required to proceed to the K+1 search.
branch_id_1 | branch_id_2 | dist_bl | dist_node_num | branch_num_fg | branch_num_mg | is_fg | is_mf | is_mg | omegaCany2any | omegaCany2dif | omegaCany2spe | omegaCdif2any | omegaCdif2dif | omegaCdif2spe | omegaCspe2any | omegaCspe2dif | omegaCspe2spe | dNCany2any | dNCany2dif | dNCany2spe | dNCdif2any | dNCdif2dif | dNCdif2spe | dNCspe2any | dNCspe2dif | dNCspe2spe | dSCany2any | dSCany2dif | dSCany2spe | dSCdif2any | dSCdif2dif | dSCdif2spe | dSCspe2any | dSCspe2dif | dSCspe2spe | OCNCoD | OCSCoD | OCN_linreg_residual | OCS_linreg_residual | N_sub_1 | N_sub_2 | S_sub_1 | S_sub_2 | OCNany2any | OCNany2dif | OCNany2spe | OCNdif2any | OCNdif2dif | OCNdif2spe | OCNspe2any | OCNspe2dif | OCNspe2spe | OCSany2any | OCSany2dif | OCSany2spe | OCSdif2any | OCSdif2dif | OCSdif2spe | OCSspe2any | OCSspe2dif | OCSspe2spe | ECNany2any | ECNany2dif | ECNany2spe | ECNdif2any | ECNdif2dif | ECNdif2spe | ECNspe2any | ECNspe2dif | ECNspe2spe | ECSany2any | ECSany2dif | ECSany2spe | ECSdif2any | ECSdif2dif | ECSdif2spe | ECSspe2any | ECSspe2dif | ECSspe2spe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
68 | 71 | 0.0995 | 6 | 2 | 0 | Y | N | N | 6.0697 | 0.4132 | 44.6273 | 1.3437 | 0.6687 | 333.0791 | 8.9314 | 162.6748 | 44.6695 | 22.1525 | 1.4071 | 170.195 | 19.7146 | 9.9032 | 164.2055 | 22.4753 | 0.1951 | 170.5806 | 3.6497 | 3.4054 | 3.8137 | 14.6714 | 14.8106 | 0.493 | 2.5164 | 0.0012 | 3.8187 | 16.9493 | 1.6683 | 0.6952 | 0.0757 | 4.975 | 2.4457 | 1.6663 | 2.0483 | 2.4159 | 0.1346 | 2.2813 | 0.2514 | 0.1183 | 0.1332 | 2.1645 | 0.0163 | 2.1481 | 0.2263 | 0.0848 | 0.1415 | 0.0848 | 0.0848 | 0 | 0.1415 | 0 | 0.1414 | 0.1091 | 0.0957 | 0.0134 | 0.0128 | 0.0119 | 0.0008 | 0.0963 | 0.0837 | 0.0126 | 0.062 | 0.0249 | 0.0371 | 0.0058 | 0.0057 | 0.0001 | 0.0562 | 0.0192 | 0.037 |
Also, your foreground specification doesn't look what you've described.
What command did you execute? I tried giving the same id to all foreground (the file I sent you) and giving them different ids, from 1 to number_of_lineages. F
I used your command as is.
This is what I get with:
csubst analyze --alignment_file $OG.csubst.fasta --rooted_tree_file $OG.csubst.tree --foreground $OG.foreground.txt --fg_exclude_wg no --fg_stem_only no --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' --max_arity 10 --exhaustive_until 1
csubst_cb_2.tsv.txt
If I execute this command with the input I sent you I never found a single triplet in ~170 genes.
If you confirm me that this configuration does what I'd like to do I'm ok with that.
Thanks a lot! F
The output looks identical to mine.
If you confirm me that this configuration does what I'd like to do I'm ok with that.
Your foreground specification doesn't look what you've described. See the tree above.
Ooooh, I think I know what you mean. My figure and your! yes I think I send you another gene tree. Sorry my bad!
I'll keep the analyses running this way then. Is that ok if I update you later on the outcome?
Thanks a lot F
Sure!
Hi @kfuku52,
I finish to run csubst
on my genes. Can I ask you few questions on how to interpret the data?
Maybe I can send you an email?
Cheers F
I prefer to make the discussion open to the public on GitHub, but if there is any confidential information involved, please feel free to send me an email.
No worries, I'll explain the plan. I have this phenotype and I want to identify candidate genes in a dataset of single-copy OGs. I'm doing 3 analyses.
The first question would be: what is the best threshold for CSUBST. If I'm interpreting correctly the wiki here, in the paper you used OCNany2spe
= 3.0 and omegaCany2spe
= 3.0 while the default for CSUBST is OCNany2spe
= 2.0 and omegaCany2spe
= 5.0. Since there's no pvalue associated to these values I'm not sure what threshold I should pick.
Because I didn't know the answer I explored the data and I did the following test. RELAX returns me a list of OG that show a putative signal of relaxation, intensification or no shift. My hypothesis is the following: In the genes with no shift in the selective pressure I don't expect to have strong convergence, when the relaxation occurs this can be due to a selective pressure that is not present anymore, genes accumulate more mutations that do not necessarily have a pattern of convergence, while the when the intensification occur genes are more likely to experience convergence.
To test that I've run csubst with OCNany2spe
= 1.0 and omegaCany2spe
= 1.0, and divided the genes into no-shift, relaxed and intensified, and then I checked if there's a shift in the distribution of omegaCany2spe between the groups.
I obtained 2047 with no-shift, 1224 with relaxation and 122 with intensification and it looks like the 122 genes with intensification had a significant shift in the distribution (p-value = 0.005) compared to both, with a median of 19.45 for both no-shift and relaxed vs 55.27 for the intensified genes.
Does this make sense? How can I further refine the csubst analysis? Is this approach somehow correct?
Cheers F
Thresholds should be adjusted based on the analysis type. For branch pair analysis (K = 2), I recommend using thresholds of OCNany2spe >= 3.0 and omegaCany2spe >= 3.0. For higher-order convergence analysis (K >= 3), setting OCNany2spe >= 2.0 and omegaCany2spe >= 5.0 could be beneficial.
It is logical to anticipate a link between intensified selection and adaptive convergence, and your analysis methodology seems solid. To enhance the robustness of your findings, it would be advisable to compare the dS_C values among the three groups. A higher dS_C in the intensified selection group may indicate an artifact. Conversely, similar dS_C values across groups would strengthen the validity of your claims.
Hi,
I checked dSCany2spe the medians for no-shoft and Relaxed is 0.3678, while in the Intensified group is lower, 0.143. The difference is slightly significant, with P = 0.046 one-sided, where no-shift and Relaxed are higher than Intensified. So, the opposite is as in cases where artifacts are present, if I understood correctly. F
Sounds good, but as I understand it, the one-sided test should aim to determine whether the intensified group exhibits a higher dS_C compared to the other two groups, to investigate the potential presence of artifacts, unless there are specific hypotheses suggesting that the intensified group could have a lower dS_C.
Yes, the wilcoxon test two-sided/one-sided (greater) shows no significance, if there's a difference is where dSCany2spe is higher in the no-shift and Relaxed groups compared with Intensified.
Hi,
It's the first time I'm trying to run this type of analysis, I guess I'd need to guidance. The test I'd like to run involves testing if the convergence is present among species with a defined phenotype in a set of single-copy OGs. Reading your very helpful wiki there are a few things that I should specify, but are not very clear to me.
--foreground
specifying the a tab-delimited file in which I specify an id (my choice?), and a corresponding species if (or a regexp). But I'm a bit confused because in the PEPC dataset all the sequences inforeground.txt
have the same id in the first column:also, are the sequences that I want to test the convergence all listed in the
foreground.txt
?There are other flags that can be applied such as
--max_arity
or--fg_exclude_wg
don't quite understand when to use them and which value to set. It seems like arity (k) is the number of branches I want to compare, if I set 2, csubst will only give pairwise comparisons, but if I set a higher number it will output the results up to that number. For example, if I have 10 species to compare and I set--max_arity
to 10, thecsubst_cb_10.tsv
will show if the convergence happened among all the 10 species, correct?Which threshold to use and how to interpret the results? If I use a threshold like in one of your example:
--cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0'
, does this mean that all the results incsubst_cb_k.tsv
will indicate the branches under convergence? Is there any associate pvalue, or something similar?Enough question atm. Thanks a lot for your help Best F