Preprint Comments and Questions

Hi,

First, thank you very much for posting this code and making the repository public. For example, part of an earlier comment relates to not being able to see this link. So, that is certainly now resolved - thank you!

Second, I noticed that both Jazlyn and Ellie are listed as contributors to this GitHub repository. So, I am not 100% sure who will respond. I was previously considering following with with Ellie, but that might be more because I have submitted an application to the Genetics, Genomics, and Bioinformatics PhD program at UC-Riverside (so, a not-too-long drive to get to know the area may be a good idea, for me). I was assuming that GitHub might be better for some questions more than others, but finding the best way to discuss more might also be worthwhile.

It is probably good for me to remind myself of more details from the original comment.

However, in terms of trying to try to find an alternative way to describe the parts related to the genotypes and analysis:

1a) At least for my pet cat, it looks like the accuracy for the Gencove imputations were lower than reported from the company (which, which to be fair, is probably not something unique for that particular company).

I would expect that tiger genetic variation is less well understood than domestic cat variation, so a thought is that the imputations for tigers might be less accurate than the imputations for my pet cat.

1b) I apologize if it is hard to understand the content of my comment and this may be upstream of the GitHub code. However, in general, I expect that you don't want to check the accuracy of the method in the same samples that were used to train the method (or, perhaps ideally, with samples closely related to the training samples).

If I use the comment to try and remember more of the earlier details, then I think I was optimistic that SRR836354 and SRR7651465 might meet the criteria of being samples not used to create the imputation model.

I also have a table in my locally saved notes that I did not create in the preprint post:

	Original Depth	Study	Coverage Group	Corrected Subspecies Group
GEN1	27.7	This Study	Unimputed	Generic
AMU1	30.3	Armstrong et al. 2020	Unimputed	Amur
5594-DP-0001_S3	42.3	Armstrong et al. 2022	Unimputed	Generic
MAL1	34.2	Armstrong et al. 2020	Unimputed	Malayan
SUM1	32.2	Armstrong et al. 2020	Unimputed	Sumatran
BEN_SI3	24.7	Armstrong et al. 2020	Unimputed	Bengal
SRR836354*	31.2	Cho et al. 2013	Imputed*	From: Bengal To: Generic
SRR7651465*	25.4	Northeast Forestry University	Imputed	From: Amur To: Generic
SRR7651468*	24.4	Northeast Forestry University	Imputed	South China

In Supplemental Table S1, I believe the asterisk for certain samples above may relate to “Individuals marked with are imputed in this dataset but were separately called with GATK to examine concordance.”*

If only looking at the table above, then I was initially confused why I was asking if the accuracy of the results in 5594-DP-0001_S3 might be over-estimated. However, I believe that might relate to the Supplemental Methods: “we developed a reference panel to impute variants for an additional 86 individuals (labeled as ‘imputed’ in Supplementary Data 1)”. For example, I believe that I had results like Supplementary Figure 15 in mind (for 5594-DP-0001_S3, but also SRR836354 and SRR7651465)

In other words, if the sample is not labeled as “Imputed,” then it may have been used to train the imputation model?

2a) While I think maximal understanding should help with troubleshooting results, I believe a fair amount of what I try to do relates to trying to critically assess results (even with imperfect understanding). So, that is what I thought might potentially be most appropriate for discussing on this GitHub repository. However, please let me know if I might have misunderstood anything.

So, I am not saying this is the absolute best solution, but I believe this is why I was asking about unsupervised ADMIXTURE analysis.

I understand that the genotype accuracy doesn't have to be perfect to achieve reasonably accurate results for large contributions to the ADMIXTURE results. However, I would guess that true genotype accuracy that drops more noticeably below 85% percent could have some effect, especially on lower fractions of estimated ancestry.

If the additional inaccurate genotypes have no systematic bias, then maybe it is hard for an artifact in the imputation so appear as an ADMIXURE cluster. However, if there might be systematic bias, then my questions relates to whether there may be a way to calculate diversity that might be unique or over-represented in samples less related to the training samples.

2b) If I understand Ellie correctly from this video, then I thought there were some smaller contributions that were unexpected. While I think there are other observations that I made to want to ask the above question, I am not sure if that might help with critical assessment of what I have described above.

3) In the comment, I mentioned something related to RFMix. In terms of what I have described above, I would (in general) guess less accurate imputations should cause more of an issue with local ancestry (with something like RFMix) than global ancestry (with something like ADMIXTURE).

However, I have downloaded the code from this repository, and I can take some more time to remember more of what I previously observed while I can look through the code (as I am able to do so). In other words, if there can be a response/discussion related to 1) and 2), then I would very much appreciate that!

Thank you again! I believe this is an interesting topic of research, and I think this discussion may be of some amount of broader importance!

Sincerely, Charles

jaam92 / Tigers

Preprint Comments and Questions #1