jaam92 / Tigers

Tiger project 2021 with Ellie
0 stars 0 forks source link

Preprint Comments and Questions #1

Closed cwarden45 closed 4 months ago

cwarden45 commented 8 months ago

Hi,

First, thank you very much for posting this code and making the repository public. For example, part of an earlier comment relates to not being able to see this link. So, that is certainly now resolved - thank you!

Second, I noticed that both Jazlyn and Ellie are listed as contributors to this GitHub repository. So, I am not 100% sure who will respond. I was previously considering following with with Ellie, but that might be more because I have submitted an application to the Genetics, Genomics, and Bioinformatics PhD program at UC-Riverside (so, a not-too-long drive to get to know the area may be a good idea, for me). I was assuming that GitHub might be better for some questions more than others, but finding the best way to discuss more might also be worthwhile.

It is probably good for me to remind myself of more details from the original comment.

However, in terms of trying to try to find an alternative way to describe the parts related to the genotypes and analysis:

1a) At least for my pet cat, it looks like the accuracy for the Gencove imputations were lower than reported from the company (which, which to be fair, is probably not something unique for that particular company).

I would expect that tiger genetic variation is less well understood than domestic cat variation, so a thought is that the imputations for tigers might be less accurate than the imputations for my pet cat.

1b) I apologize if it is hard to understand the content of my comment and this may be upstream of the GitHub code. However, in general, I expect that you don't want to check the accuracy of the method in the same samples that were used to train the method (or, perhaps ideally, with samples closely related to the training samples).

If I use the comment to try and remember more of the earlier details, then I think I was optimistic that SRR836354 and SRR7651465 might meet the criteria of being samples not used to create the imputation model.

I also have a table in my locally saved notes that I did not create in the preprint post:

Original Depth Study Coverage
Group
Corrected
Subspecies
Group
GEN1 27.7 This Study Unimputed Generic
AMU1 30.3 Armstrong et al. 2020 Unimputed Amur
5594-DP-0001_S3 42.3 Armstrong et al. 2022 Unimputed Generic
MAL1 34.2 Armstrong et al. 2020 Unimputed Malayan
SUM1 32.2 Armstrong et al. 2020 Unimputed Sumatran
BEN_SI3 24.7 Armstrong et al. 2020 Unimputed Bengal
SRR836354* 31.2 Cho et al. 2013 Imputed* From: Bengal
To: Generic
SRR7651465* 25.4 Northeast Forestry University Imputed From: Amur
To: Generic
SRR7651468* 24.4 Northeast Forestry University Imputed South China

In Supplemental Table S1, I believe the asterisk for certain samples above may relate to “Individuals marked with are imputed in this dataset but were separately called with GATK to examine concordance.”*

If only looking at the table above, then I was initially confused why I was asking if the accuracy of the results in 5594-DP-0001_S3 might be over-estimated. However, I believe that might relate to the Supplemental Methods: “we developed a reference panel to impute variants for an additional 86 individuals (labeled as ‘imputed’ in Supplementary Data 1)”. For example, I believe that I had results like Supplementary Figure 15 in mind (for 5594-DP-0001_S3, but also SRR836354 and SRR7651465)

In other words, if the sample is not labeled as “Imputed,” then it may have been used to train the imputation model?

2a) While I think maximal understanding should help with troubleshooting results, I believe a fair amount of what I try to do relates to trying to critically assess results (even with imperfect understanding). So, that is what I thought might potentially be most appropriate for discussing on this GitHub repository. However, please let me know if I might have misunderstood anything.

So, I am not saying this is the absolute best solution, but I believe this is why I was asking about unsupervised ADMIXTURE analysis.

I understand that the genotype accuracy doesn't have to be perfect to achieve reasonably accurate results for large contributions to the ADMIXTURE results. However, I would guess that true genotype accuracy that drops more noticeably below 85% percent could have some effect, especially on lower fractions of estimated ancestry.

If the additional inaccurate genotypes have no systematic bias, then maybe it is hard for an artifact in the imputation so appear as an ADMIXURE cluster. However, if there might be systematic bias, then my questions relates to whether there may be a way to calculate diversity that might be unique or over-represented in samples less related to the training samples.

2b) If I understand Ellie correctly from this video, then I thought there were some smaller contributions that were unexpected. While I think there are other observations that I made to want to ask the above question, I am not sure if that might help with critical assessment of what I have described above.

3) In the comment, I mentioned something related to RFMix. In terms of what I have described above, I would (in general) guess less accurate imputations should cause more of an issue with local ancestry (with something like RFMix) than global ancestry (with something like ADMIXTURE).

However, I have downloaded the code from this repository, and I can take some more time to remember more of what I previously observed while I can look through the code (as I am able to do so). In other words, if there can be a response/discussion related to 1) and 2), then I would very much appreciate that!

Thank you again! I believe this is an interesting topic of research, and I think this discussion may be of some amount of broader importance!

Sincerely, Charles

cwarden45 commented 4 months ago

Thank you again for posting the preprint versions and GitHub code.

I will take some time to think about the most tactful way for me to potentially have follow-up discussion. For example, I am grateful to have been accepted into the GGB PhD program at UCR, and I will start enrollment in that program later in the year.

However, something else caused me to take another look at the preprint, and I noticed that there was a "v2" version of the preprint as well as some added files on GitHub. So, I thought I should try to briefly add another comment.

For 1a) and 1b), I think Supplementary Table S3 in the updated Supplementary Tables may match what I was trying to describe (if the “SRR” samples were more like new samples and the other samples were used to define the reference haplotypes, and the accuracy was higher in the “training” reference samples than the new samples).

However, to be fair, the ancestry results are similar, such as can be seen in the uploaded ancestry_concordance_all.csv file. I believe this also matches the detailed description in the Supplementary Information, in both versions.

In general, I think that may match my experiences and expectations (in other contexts). In the above ancestry_concordance_all.csv example, lower fraction assignments are still consistent with lower coverage imputations (as expected from at least one of the “v1” Supplemental Figures). I am not sure if something could be more different at lower frequency estimates in other samples that were not used to define the reference haplotypes. However, I apologize that I should have mentioned this within the earlier parts for 2), and I believe there are additional factors to take into consideration (after re-watching the posted video).

If I understand correctly, then I think there may be agreement in terms of taking the end goal into consideration (such as looking at broad ancestry versus other possible applications). So, I have some things that I was/am curious about, but my impression is that these are not what is most important for the main conclusions in this preprint.

Thank you yet again!

Best Wishes, Charles