error with output for dapc_ascore in file dapc.K.#.#.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt

bmillerlab commented 2 years ago

Hello, Hopefully I can explain this correctly by submitting some sample files to show what I mean. I've been trying to use SambaR to best advantage to explore structure in my very large SNP dataset - many samples >300, and many SNPs scored originally but SambaR has "identified" the most highly significant as ~1200 SNPs (setting indmiss and snpmiss as best I can based on your 2021 paper) so the computations really aren't too bad. I kept thinking I was not understanding the ascore output correctly when I looked at the dc_score file, but today I reran my analysis after removing a few populations that have "dirtier data" and I finally had the AHA moment while comparing the dc_score.txt files between the WITH_prior_popinfo and the WITHOUT_prior_popinfo between the different population datasets.

The WITHOUT_prior_popinfo files appear to have totally normal output where every variation of K# per optimum number of PCs has a value of dc_score which varies based as expected (0.0 up to >1.0 and higher as clustering gets more or less potentially meaningful in the DAPC).

But the WITH_prior_popinfo files have exactly the same dc_score in every file regardless of the K# per optimum of PCs. So there is no real way to decide if the varying K# tested is getting a better (closer to 0) or worse (>1) dc_score to evaluate the DAPC output. Worse for me, I can't really meaningfully tell if I should keep using the WITH_prior_popinfo because the dc_score supports a better DAPC analysis when I do. So far the WITH_prior_popinfo single score that I get repeatedly appears to support using the a priori population information since it is closer to 0 (about 0.5 for the weighted_meandc, so not really great) but it is the same for every K# tested. When I look at the WITHOUT_prior_pop info weighted_meandc it is higher - ranging from ~0.8 to over 1.2 - depending on the K# being tested. But I only have one WITH_prior_popinfo weighted_meandc to compare and I don't even know if it was calculated correctly for K=2 (I'm guessing the first score is being calculated and then the K value is not changing in every additional calculation for dc_score that is run - but that is just a guess).

I am finding the output from SambaR very helpful in exploring my hypothesis for my populations for structure overall, but this problem is making the DAPC part confusing. Would it be possible to find out how to fix this problem? Please let me know if I need to give you additional information to get help. Thank you!! I am adding 5 WITH_prior_popinfo.txt files to show the problem of them being the same. I could add more, but they are all the same ....

dapc.K.2.15.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt dapc.K.3.15.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt dapc.K.4.15.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt dapc.K.5.15.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt dapc.K.6.15.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt

mennodejong1986 commented 2 years ago

I am not an expert in DAPC analyses but here are a few of my thoughts:

The labels 'WITH_prior_popinfo' and 'WITHOUT_prior_popinfo' were confusing. Therefore, I just uploaded a new version of SambaR (still '1.07') which uses different, more telling labels, namely 'PRIOR_POPINFO' and 'INFERRED_CLUSTERS'. These labels convey better the difference between both approaches. DAPC needs a priori defined grouping. This can be grouping specified a priori by the user (e.g., mygenlight@pop), but alternatively the clustering can also be inferred using the find.clusters function. SambaR runs DAPC using both options. The differences between K=2, K=3, K=4, etc, affect the findclusters functions only. Given that the findclusters() function is not used for the 'PRIOR_POPINFO' approach, this explains why the dc_scores are identical within each folder. So it is not an error, but expected behaviour.
The dc-score, which is implemented in SambaR, is not an accepted metric to evaluate DAPC output. It is just a score I included to measure whether whether clusters (e.g., populations) are distinct or instead overlap with each other. Accepted metrices are the a-score and the cross validation score, which in theory can guide you to decide on the appropriate number of pc's. The BIC-score can help to decide on the appropriate number of clusters (if using the INFERRED_CLUSTERS approach).
Given the complexity of DAPC analyses, and the sensitivity to the settings, you could perhaps focus instead at the outcome of PCA or PCoA analyses, or hierarchical approaches (e.g., NJ or OLS clustering of for example Euclidean distance matrices). These methods are less sensitive to settings and the underlying methods are easier to comprehend.

Hope this helps!

bmillerlab commented 2 years ago

Hi, Thanks for the information. I'm confused about updating though as I actually just used the link from github: source("https://github.com/mennodejong1986/SambaR/raw/master/SAMBAR_v1.07.txt") to install this program recently, l so I think I have the latest version. Is there a different link I should be using?

I am working on the other types of analyses output from SambaR more now. It's all a big process for me because I have done analysis with things like 16S datasets and microsatellites, but never with SNPs and such giant datasets.

I did use the BIC score to set the number of clusters (that is why I set the maximum clusters to consider at 15) and the a_score supports using 14 PC's. I also did put in the a priori population information for the analysis. I wanted wanted to use the dp_score to see whether the WITHOUT a priori population clusters were perhaps better, as I actually do have some questions about whether some of the populations (which vary in distances apart - some quite close together) should be merged together. Since the output for the WITHOUT population cluster dp_score changes in every K# file, I did not realize it should not change in the WITH population information. I need to study this more, clearly. In the future, I guess I am going to use adegenet and rerun the analysis using the subset of "highly informative" SNPs output by SambaR to get some more specifics if I decide I want to use DAPC. I tried to use adegenet before on my full dataset of SNPs, but it is much too large and I couldn't get output for any type of principal component analysis (hence my excitement when I read your SambaR paper).

I will also look into using BPA, but my sample sizes hugely vary and only a few of the populations have >30 total samples, so I'm kind of stuck with validity issues for that analysis.

Thanks so much for your quick reply and the additional information.

mennodejong1986 commented 2 years ago

Hi, yes, you are right, you will have the latest version (including changes I made yesterday) if you use the github link. The panel of highly informative SNPs are potentially useful for population assignment studies using a limited number of SNPs (assuming it is worth the initial effort to the labwork to produce markers), but other than that I would not use such biased SNP subsets for structure analyses. Better to use a random SNP dataset. Also, the BPA-test is not a widely accepted method, and is suited only for population assignment once populations have been clearly defined. So perhaps better to go with LEA admixture analyses or similar methods (e.g., ADMIXTURE software). Good luck!

bmillerlab commented 2 years ago

Thanks for the additional helpful information. Fingers crossed on my end!!

mennodejong1986 / SambaR

error with output for dapc_ascore in file dapc.K.#.#.scatter.dapc.WITH_prior_popinfo.axis1vs2.dc_score.txt #21