Questions and general guidance

francicco commented 9 months ago

Hi,

It's the first time I'm trying to run this type of analysis, I guess I'd need to guidance. The test I'd like to run involves testing if the convergence is present among species with a defined phenotype in a set of single-copy OGs. Reading your very helpful wiki there are a few things that I should specify, but are not very clear to me.

--foreground specifying the a tab-delimited file in which I specify an id (my choice?), and a corresponding species if (or a regexp). But I'm a bit confused because in the PEPC dataset all the sequences in foreground.txt have the same id in the first column:

1   Alternanthera_pungens_AY950665
1   Amaranthus_hypochondriacus_Z68125
1   Bienertia_sinuspersici_DQ538352
1   Cyperus_sanguinolentus_FM208046
1   Eleocharis_vivipara_AB085948
1   Fimbristylis_littoralis_FM208037
1   Flaveria_australasica_Z25853
1   Setaria_italica_AF495586
1   Sorghum_bicolor_X17379
1   Suaeda_aralocaspica_DQ538353
1   Suaeda_eltonica_DQ538354
1   Zea_mays_X15239

also, are the sequences that I want to test the convergence all listed in the foreground.txt?

There are other flags that can be applied such as --max_arity or --fg_exclude_wg don't quite understand when to use them and which value to set. It seems like arity (k) is the number of branches I want to compare, if I set 2, csubst will only give pairwise comparisons, but if I set a higher number it will output the results up to that number. For example, if I have 10 species to compare and I set --max_arity to 10, the csubst_cb_10.tsv will show if the convergence happened among all the 10 species, correct?
Which threshold to use and how to interpret the results? If I use a threshold like in one of your example: --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0', does this mean that all the results in csubst_cb_k.tsv will indicate the branches under convergence? Is there any associate pvalue, or something similar?

Enough question atm. Thanks a lot for your help Best F

kfuku52 commented 9 months ago

But I'm a bit confused because in the PEPC dataset all the sequences in foreground.txt have the same id in the first column:

The same ID works fine with --fg_exclude_wg no. However, if you prefer to assign different IDs to individual lineages, you do not need the --fg_exclude_wg specification. While both settings lead to a similar analysis, there are subtle differences. For instance, in the first scenario, convergence within a lineage comprising multiple species is analyzed, which isn't the case in the second scenario.

also, are the sequences that I want to test the convergence all listed in the foreground.txt?

If I've understood you correctly, all genes of interest should indeed be listed in this file. Failing to do so makes it challenging to identify foreground branch combinations in the CSUBST output files.

For example, if I have 10 species to compare and I set --max_arity to 10, the csubst_cb_10.tsv will show if the convergence happened among all the 10 species, correct?

The csubst_cb_10.tsv file will show if there is convergence among all 10 species, but this is contingent on a couple of conditions. This outcome is expected if all 10 species are independent (with no sister pairs) and if there is a 9- or 10-way convergence that meets the thresholds for a higher-order convergence analysis (--cutoff_stat).

If I use a threshold like in one of your example: --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0', does this mean that all the results in csubst_cb_k.tsv will indicate the branches under convergence?

It is important to note that not all results in csubst_cb_K.tsv will necessarily indicate branches under convergence. Only a subset of branch combinations in this file may meet the specified thresholds, and it could include non-convergent combinations as well. However, all convergence metrics are provided in the file, so you can verify each one.

Is there any associate pvalue, or something similar?

No, currently not supported.

francicco commented 9 months ago

Hi @kfuku52,

Thanks a lot. Now it's much more clear. Regarding the output. It looks extremely complex, with many values. Can you give me some tips on where to start? if I want to show a pattern of convergence in a paper for example, which values would be the best to report?

Thanks a lot F

kfuku52 commented 9 months ago

Yes, please check this document out. https://github.com/kfuku52/csubst/wiki/Interpreting-csubst-outputs

francicco commented 9 months ago

Hi @kfuku52

I'm assigning foreground lineages the same id: 1. To be more precise I also set as foreground the parent nodes of the teminal branches when foreground leaves are monophiletic. and executing this command:

csubst analyze \
                --alignment_file $OG.csubst.fasta \
                --rooted_tree_file $OG.csubst.tree \
                --foreground $OG.foreground.txt \
                --fg_exclude_wg no \
                --fg_stem_only no \
                --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' \
                --max_arity 10 \
                --exhaustive_until 1

Unfortunately csubst will only generate arity of 2, which is something I don't want, because I might have more than 2 internal branches that converge.

For example, in this gene tree I'd like to test also the roots of the 3 highlighted branches.

If I then assign to all the branches highlighted a different id and I execute csubst without --fg_exclude_wg, increasing --exhaustive_until, csubst will test all branches.

I do I fix that? Cheers F

kfuku52 commented 9 months ago

Could you please clarify by visually indicating the branches you wish to analyze? I want to ensure I fully understand your question.

francicco commented 9 months ago

I'd want to test convergence among the clades highlighted in pink. The three of them have the same phenotype that I'd like to test for convergence. F

kfuku52 commented 9 months ago

Then your command seems correct. Please note that CSUBST does not produce outputs for K>=3 if there is no convergent combinations at K=2.

francicco commented 9 months ago

Yes, but I have a lot of branches it seems odd it only finds pairs never a triplet. At the moment it processed 170 genes, and not a single triplet? Odd

kfuku52 commented 9 months ago

Please send your data if you want me to check it.

francicco commented 9 months ago

Thanks a lot!

OG_10275.foreground.txt OG_10275.csubst.tree.txt OG_10275.csubst.fasta.txt

F

kfuku52 commented 9 months ago

There is only one (1/176) convergent branch combination that satisfies --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' at K=2. Please note that at least 2 branch combinations are required to proceed to the K+1 search.

branch_id_1	branch_id_2	dist_bl	dist_node_num	branch_num_fg	branch_num_mg	is_fg	is_mf	is_mg	omegaCany2any	omegaCany2dif	omegaCany2spe	omegaCdif2any	omegaCdif2dif	omegaCdif2spe	omegaCspe2any	omegaCspe2dif	omegaCspe2spe	dNCany2any	dNCany2dif	dNCany2spe	dNCdif2any	dNCdif2dif	dNCdif2spe	dNCspe2any	dNCspe2dif	dNCspe2spe	dSCany2any	dSCany2dif	dSCany2spe	dSCdif2any	dSCdif2dif	dSCdif2spe	dSCspe2any	dSCspe2dif	dSCspe2spe	OCNCoD	OCSCoD	OCN_linreg_residual	OCS_linreg_residual	N_sub_1	N_sub_2	S_sub_1	S_sub_2	OCNany2any	OCNany2dif	OCNany2spe	OCNdif2any	OCNdif2dif	OCNdif2spe	OCNspe2any	OCNspe2dif	OCNspe2spe	OCSany2any	OCSany2dif	OCSany2spe	OCSdif2any	OCSdif2dif	OCSdif2spe	OCSspe2any	OCSspe2dif	OCSspe2spe	ECNany2any	ECNany2dif	ECNany2spe	ECNdif2any	ECNdif2dif	ECNdif2spe	ECNspe2any	ECNspe2dif	ECNspe2spe	ECSany2any	ECSany2dif	ECSany2spe	ECSdif2any	ECSdif2dif	ECSdif2spe	ECSspe2any	ECSspe2dif	ECSspe2spe
68	71	0.0995	6	2	0	Y	N	N	6.0697	0.4132	44.6273	1.3437	0.6687	333.0791	8.9314	162.6748	44.6695	22.1525	1.4071	170.195	19.7146	9.9032	164.2055	22.4753	0.1951	170.5806	3.6497	3.4054	3.8137	14.6714	14.8106	0.493	2.5164	0.0012	3.8187	16.9493	1.6683	0.6952	0.0757	4.975	2.4457	1.6663	2.0483	2.4159	0.1346	2.2813	0.2514	0.1183	0.1332	2.1645	0.0163	2.1481	0.2263	0.0848	0.1415	0.0848	0.0848	0	0.1415	0	0.1414	0.1091	0.0957	0.0134	0.0128	0.0119	0.0008	0.0963	0.0837	0.0126	0.062	0.0249	0.0371	0.0058	0.0057	0.0001	0.0562	0.0192	0.037

Also, your foreground specification doesn't look what you've described.

francicco commented 9 months ago

What command did you execute? I tried giving the same id to all foreground (the file I sent you) and giving them different ids, from 1 to number_of_lineages. F

kfuku52 commented 9 months ago

I used your command as is.

francicco commented 9 months ago

This is what I get with: csubst analyze --alignment_file $OG.csubst.fasta --rooted_tree_file $OG.csubst.tree --foreground $OG.foreground.txt --fg_exclude_wg no --fg_stem_only no --cutoff_stat 'OCNany2spe,2.0|omegaCany2spe,5.0' --max_arity 10 --exhaustive_until 1 csubst_cb_2.tsv.txt

If I execute this command with the input I sent you I never found a single triplet in ~170 genes.

If you confirm me that this configuration does what I'd like to do I'm ok with that.

Thanks a lot! F

kfuku52 commented 9 months ago

The output looks identical to mine.

If you confirm me that this configuration does what I'd like to do I'm ok with that.

Your foreground specification doesn't look what you've described. See the tree above.

francicco commented 9 months ago

Ooooh, I think I know what you mean. My figure and your! yes I think I send you another gene tree. Sorry my bad!

I'll keep the analyses running this way then. Is that ok if I update you later on the outcome?

Thanks a lot F

kfuku52 commented 9 months ago

Sure!

francicco commented 7 months ago

Hi @kfuku52,

I finish to run csubst on my genes. Can I ask you few questions on how to interpret the data? Maybe I can send you an email?

Cheers F

kfuku52 commented 7 months ago

I prefer to make the discussion open to the public on GitHub, but if there is any confidential information involved, please feel free to send me an email.

francicco commented 7 months ago

No worries, I'll explain the plan. I have this phenotype and I want to identify candidate genes in a dataset of single-copy OGs. I'm doing 3 analyses.

BUSTED-PH - Species w/ phenotype Vs no-phenotype
RELAX - Species w/ phenotype Vs no-phenotype
CSUBST - Species w/ phenotype

The first question would be: what is the best threshold for CSUBST. If I'm interpreting correctly the wiki here, in the paper you used OCNany2spe = 3.0 and omegaCany2spe = 3.0 while the default for CSUBST is OCNany2spe = 2.0 and omegaCany2spe = 5.0. Since there's no pvalue associated to these values I'm not sure what threshold I should pick.

Because I didn't know the answer I explored the data and I did the following test. RELAX returns me a list of OG that show a putative signal of relaxation, intensification or no shift. My hypothesis is the following: In the genes with no shift in the selective pressure I don't expect to have strong convergence, when the relaxation occurs this can be due to a selective pressure that is not present anymore, genes accumulate more mutations that do not necessarily have a pattern of convergence, while the when the intensification occur genes are more likely to experience convergence. To test that I've run csubst with OCNany2spe = 1.0 and omegaCany2spe = 1.0, and divided the genes into no-shift, relaxed and intensified, and then I checked if there's a shift in the distribution of omegaCany2spe between the groups. I obtained 2047 with no-shift, 1224 with relaxation and 122 with intensification and it looks like the 122 genes with intensification had a significant shift in the distribution (p-value = 0.005) compared to both, with a median of 19.45 for both no-shift and relaxed vs 55.27 for the intensified genes. Does this make sense? How can I further refine the csubst analysis? Is this approach somehow correct?

Cheers F

kfuku52 commented 7 months ago

Thresholds should be adjusted based on the analysis type. For branch pair analysis (K = 2), I recommend using thresholds of OCNany2spe >= 3.0 and omegaCany2spe >= 3.0. For higher-order convergence analysis (K >= 3), setting OCNany2spe >= 2.0 and omegaCany2spe >= 5.0 could be beneficial.

It is logical to anticipate a link between intensified selection and adaptive convergence, and your analysis methodology seems solid. To enhance the robustness of your findings, it would be advisable to compare the dS_C values among the three groups. A higher dS_C in the intensified selection group may indicate an artifact. Conversely, similar dS_C values across groups would strengthen the validity of your claims.

francicco commented 7 months ago

Hi,

I checked dSCany2spe the medians for no-shoft and Relaxed is 0.3678, while in the Intensified group is lower, 0.143. The difference is slightly significant, with P = 0.046 one-sided, where no-shift and Relaxed are higher than Intensified. So, the opposite is as in cases where artifacts are present, if I understood correctly. F

kfuku52 commented 7 months ago

Sounds good, but as I understand it, the one-sided test should aim to determine whether the intensified group exhibits a higher dS_C compared to the other two groups, to investigate the potential presence of artifacts, unless there are specific hypotheses suggesting that the intensified group could have a lower dS_C.

francicco commented 7 months ago

Yes, the wilcoxon test two-sided/one-sided (greater) shows no significance, if there's a difference is where dSCany2spe is higher in the no-shift and Relaxed groups compared with Intensified.

kfuku52 / csubst

Questions and general guidance #51