hringbauer / ancIBD

Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
GNU General Public License v3.0
9 stars 3 forks source link

filters/snps_bcftools_chX.csv #15

Open zmaroti opened 7 months ago

zmaroti commented 7 months ago

Hi,

Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.

I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.

But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?

Regards, Zoltan

hringbauer commented 7 months ago

There is no deeper reason for this - we had to choose one of the "1240k SNP sets" out there. Those 1240k SNP sets are curated and filtered to various degrees.

We picked a more filtered SNP set used by several aDNA labs to be sure that everyone has a superset in their imputed vcf (potentially already downsampled to some 1240k set).

You could also keep the "full" 1240k SNP set in the hdf5 creation but, in practice, a few percent more or less SNPs (after using all the data in imputation) should make very little difference.

Only keep in mind that if you choose a drastically different SNP set, our "default" parameters will not be optimal anymore, and our "testing" results, including recommended thresholds, will not apply any longer.

zmaroti commented 7 months ago

Thanks for the explanation. We impute all common 1KG positions (which nearly fully overlaps with 1240K marker set) as we only used true shotgun WGS (not capture data) for imputation, thus I have confidence that we can increase slightly the marker count, and it should not influence the outcome of our results (~4.5% more random markers shouldn't makr much difefrence I agree). Since in the tutorial we had the recomended thresholds for the 1240K subset we did not try to use the whole imputed SNP set. Especially since the marker set is too large and likely the run would take ages with it. I am still thinking on a strategy to thin the imputed data based on AF, linkage, the GLIMPSE2 imputation quality estimation, in a way that we have more or less evenly dense marker set that is informative to our test individuals. About recommended thresholds and marker sets used. I will send you an email soonish that adresses the sensitivity and specificity of the IBD filtering step /that is implemented in the create_ind_ibd_df() function/ and also the "masking" region issue. I am just compiling all the supporting data and the explanation of the strategy + an implementation to test. Sincerelly Yours, Zoltan Maroti On Wed, Jan 31, 2024 at 14:31, "Harald Ringbauer" wrote: There is no deeper reason for this - we had to choose one of the "1240k SNP sets" out there. Those 1240k SNP sets are curated and filtered to various degrees. We picked a more filtered SNP set used by several aDNA labs to be sure that everyone has a superset in their imputed vcf (potentially already downsampled to some 1240k set). You could also keep the "full" 1240k SNP set in the hdf5 creation but, in practice, a few percent more or less SNPs (after using all the data in imputation) should make very little difference. Only keep in mind that if you choose a drastically different SNP set, our "default" parameters will not be optimal anymore, and our "testing" results, including recommended thresholds, will not apply any longer. — Reply to this email directly, view it on GitHub (https://github.com/hringbauer/ancIBD/issues/15#issuecomment-1919108689), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AILYNWQYHWSYGWKGBVAUCPDYRJBSBAVCNFSM6AAAAABCIV5P46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGEYDQNRYHE). You are receiving this because you authored the thread.Message ID:

zmaroti commented 7 months ago

Dear Prof Ringbauer, I had been experimenting with the ancIBD package and based on the result I am proposing a solution for the masking problem/IBD filtration. Please find the attached document with my analysis and conclusions. I am also sending a tool, so you could hopefully test the proposed method on synthetic data with known truth to evaluate sensitivity/specificity of the algorithm. Sincerelly Yours, Zoltan Maroti On Wed, Jan 31, 2024 at 14:31, "Harald Ringbauer" wrote: There is no deeper reason for this - we had to choose one of the "1240k SNP sets" out there. Those 1240k SNP sets are curated and filtered to various degrees. We picked a more filtered SNP set used by several aDNA labs to be sure that everyone has a superset in their imputed vcf (potentially already downsampled to some 1240k set). You could also keep the "full" 1240k SNP set in the hdf5 creation but, in practice, a few percent more or less SNPs (after using all the data in imputation) should make very little difference. Only keep in mind that if you choose a drastically different SNP set, our "default" parameters will not be optimal anymore, and our "testing" results, including recommended thresholds, will not apply any longer. — Reply to this email directly, view it on GitHub (https://github.com/hringbauer/ancIBD/issues/15#issuecomment-1919108689), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AILYNWQYHWSYGWKGBVAUCPDYRJBSBAVCNFSM6AAAAABCIV5P46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGEYDQNRYHE). You are receiving this because you authored the thread.Message ID: