plantimals / 2vcf

convert 23andme or Ancestry.com raw genotype calls into VCF format, with dbSNP annotations
MIT License
51 stars 4 forks source link

Missing markers after conversion #13

Closed lakishadavid closed 5 years ago

lakishadavid commented 5 years ago

I'm converting AncestryDNA zip files to vcf and noticing that 2vcf has removed some of my markers in the output file.

Many of the AncestryDNA files have this description:

`#AncestryDNA raw data download

This file was generated by AncestryDNA at: 07/27/2018 16:39:29 UTC

Data was collected using AncestryDNA array version: V2.0

Data is formatted using AncestryDNA converter version: V1.0`

In my .zip file downloaded from Ancestry.com, I have markers rs369202065 and rs199476136, but they do not show up in the output vcf file (with rs199476136 also not showing up in the GRCh37.p13 reference file).

For as far as I can tell, markers that do transfer over are correct.

plantimals commented 5 years ago

thank you for the report @lakishadavid , I am investigating now.

plantimals commented 5 years ago

@lakishadavid that RS_ID ( rs369202065 ) is not present in the reference I include with this utility, which is dbSNP build 146 https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=146 . Build 146 was released in 2015, and rs369202065 was published in 2013, so it's possible it just didn't make the build. I'll see about revving the version of the included reference. I'll ping this issue when I have a PR.

plantimals commented 5 years ago

@lakishadavid please find our most recent release https://github.com/plantimals/2vcf/releases/tag/v0.4.0 and the most recent reference: http://openb.io/2vcf/2vcf-v2.0.vcf.gz

I did not find either of your missing RSID's in the illumina manifest files, but I did manually add them to the reference VCF mentioned above.

would you be willing to cut the column of RSID's out of your ancestry data and attach it to this issue? I haven't been able to find a listing of their custom sites. if you have the time, I will take that list and re-filter the dbSNP file to make an augmented reference. if you have the bandwidth and time to wait while it runs, you can also use the full dbSNP reference vcf with 2vcf, it'll just take a really long time to filter through all 15GB of it.

thanks for your interest in 2vcf, please let me know if there's any other way I can help.

lakishadavid commented 5 years ago

Dear Rob,

Thank you for your responsiveness! Sure, I'll send you a file with the RSIDs from the AncestryDNA file. I'm currently collecting 48 samples for the H3Africa array which has 2.26 million markers (formatted like 23andMe files) so it may be best for me to go ahead and use the full dbSNP reference vcf. The 15GB file does take a long time so I'll figure out how to run 2vcf on Microsoft Azure. You'll hear from me in a few hours with the RSID file.

LaKisha

On Sun, Aug 25, 2019 at 9:26 PM Rob notifications@github.com wrote:

@lakishadavid https://github.com/lakishadavid please find our most recent release https://github.com/plantimals/2vcf/releases/tag/v0.4.0 and the most recent reference: http://openb.io/2vcf/2vcf-v2.0.vcf.gz

I did not find either of your missing RSID's in the illumina manifest files, but I did manually add them to the reference VCF mentioned above.

would you be willing to cut the column of RSID's out of your ancestry data and attach it to this issue? I haven't been able to find a listing of their custom sites. if you have the time, I will take that list and re-filter the dbSNP file to make an augmented reference. if you have the bandwidth and time to wait while it runs, you can also use the full dbSNP reference vcf with 2vcf, it'll just take a really long time to filter through all 15GB of it.

thanks for your interest in 2vcf, please let me know if there's any other way I can help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plantimals/2vcf/issues/13?email_source=notifications&email_token=ALHHGO6RUTERCFX3KAXDS6LQGM5OBA5CNFSM4IHHXLS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5DCWLY#issuecomment-524692271, or mute the thread https://github.com/notifications/unsubscribe-auth/ALHHGOZFOIJVE2YELLZNDQLQGM5OBANCNFSM4IHHXLSQ .

plantimals commented 5 years ago

@lakishadavid I am happy to turn this around as quickly as you would like. if you can get me the 2.26M RSID's, I can generate an updated reference for you tonight. if you're running 48 of them, that's 48 times that 2vcf would have to traverse that 15GB reference.

plantimals commented 5 years ago

@lakishadavid good news, I was able to find the chip manifest ( https://chipinfo.h3abionet.org/help ). There are about 1.5M RSID's that were not previously included, so I pulled those in and am now generating a new reference. I'll upload it in an hour or so and give you a path to download it.

plantimals commented 5 years ago

@lakishadavid the new reference, which includes the previous 23andme & ancestry sites, but also the H3 sites, can be downloaded here: http://openb.io/2vcf/2vcf-h3-array.vcf.gz

lakishadavid commented 5 years ago

Wow, that's incredible! Thank you so much!

On Mon, Aug 26, 2019 at 7:16 PM Rob notifications@github.com wrote:

@lakishadavid https://github.com/lakishadavid the new reference, which includes the previous 23andme & ancestry sites, but also the H3 sites, can be downloaded here: http://openb.io/2vcf/2vcf-h3-array.vcf.gz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plantimals/2vcf/issues/13?email_source=notifications&email_token=ALHHGO6QDEKW4P5S2YEAGYTQGRW5ZA5CNFSM4IHHXLS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GB6OI#issuecomment-525082425, or mute the thread https://github.com/notifications/unsubscribe-auth/ALHHGO5R24T5OOUNLK3EMKLQGRW5ZANCNFSM4IHHXLSQ .

lakishadavid commented 5 years ago

Do you still want that Ancestry file?

On Mon, Aug 26, 2019 at 10:02 PM LaKisha David lakishatdavid@gmail.com wrote:

Wow, that's incredible! Thank you so much!

On Mon, Aug 26, 2019 at 7:16 PM Rob notifications@github.com wrote:

@lakishadavid https://github.com/lakishadavid the new reference, which includes the previous 23andme & ancestry sites, but also the H3 sites, can be downloaded here: http://openb.io/2vcf/2vcf-h3-array.vcf.gz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plantimals/2vcf/issues/13?email_source=notifications&email_token=ALHHGO6QDEKW4P5S2YEAGYTQGRW5ZA5CNFSM4IHHXLS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GB6OI#issuecomment-525082425, or mute the thread https://github.com/notifications/unsubscribe-auth/ALHHGO5R24T5OOUNLK3EMKLQGRW5ZANCNFSM4IHHXLSQ .

plantimals commented 5 years ago

@lakishadavid if you sent a sample to Ancestry for genotyping, I'd be happy to see the RSID's, just to see which sites they're calling.

if the raw reads come from the H3A chip, you should be covered with the augmented reference. I'm not sure what your analysis is after or what exact inputs you have, but if you need anything else, like if you find out that there's still some missing RSID's, or any other issue, let me know and I'll be happy to help.

lakishadavid commented 5 years ago

Thank you, Rob. Please see the OneDrive link below for the AncestryDNA RSID file. The two that I gave as examples of missing SNPs are the first two (rs369202065 and rs199476136). I'm confused as to why Ancestry has these listed as being on chromosome 1 when dbSNP has it on MT, even with the new naming between builds. In any event, thanks for sending the H3A vcf file to use as the 2vcf reference.

After merging my samples into one vcf file, I use BEAGLE (phasing) and Refined IBD (identify matching segments). The problem was that BEAGLE>Refined IBD placed matching segments on different chromosomes than where other utilities placed them. That's what lead me to realize that there were some missing RSIDs. I don't think that was the source of the incorrect output, but it definitely will be helpful to have this updated reference to make greater use of the H3A markers.

I'm listing my use of 2vcf on my website (https://takir.org).

Thanks again,

AncestryDNA https://1drv.ms/u/s!ApCvnOfVuOmmjKgb-i-uNvcCQWDCUQ?e=MyvN2H

On Tue, Aug 27, 2019 at 5:06 AM LaKisha David lakishatdavid@gmail.com wrote:

Thank you, Rob. Please see the attachment for the AncestryDNA RSID file. The two that I gave as examples of missing SNPs are the first two (rs369202065 and rs199476136). I'm confused as to why Ancestry has these listed as being on chromosome 1 when dbSNP has it on MT, even with the new naming between builds. In any event, thanks for sending the H3A vcf file to use as the 2vcf reference.

After merging my samples into one vcf file, I use BEAGLE (phasing) and Refined IBD (identify matching segments). The problem was that BEAGLE>Refined IBD placed matching segments on different chromosomes than where other utilities placed them. That's what lead me to realize that there were some missing RSIDs. I don't think that was the source of the incorrect output, but it definitely will be helpful to have this updated reference to make greater use of the H3A markers.

I'm listing my use of 2vcf on my website (https://takir.org).

Thanks again,

LaKisha

On Mon, Aug 26, 2019 at 10:54 PM Rob notifications@github.com wrote:

@lakishadavid https://github.com/lakishadavid if you sent a sample to Ancestry for genotyping, I'd be happy to see the RSID's, just to see which sites they're calling.

if the raw reads come from the H3A chip, you should be covered with the augmented reference. I'm not sure what your analysis is after or what exact inputs you have, but if you need anything else, like if you find out that there's still some missing RSID's, or any other issue, let me know and I'll be happy to help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plantimals/2vcf/issues/13?email_source=notifications&email_token=ALHHGO3E5BRPG5P2WNU54J3QGSQPRA5CNFSM4IHHXLS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5GMSGA#issuecomment-525125912, or mute the thread https://github.com/notifications/unsubscribe-auth/ALHHGO2TR5OB65NZCCT23F3QGSQPRANCNFSM4IHHXLSQ .

plantimals commented 5 years ago

thanks so much for the RSID's. I appreciate it. I'm not sure why ancestry had that SNP on the wrong chromosome.

I've been planning on taking this project further, to make imputation/phasing and IBD calling, as you describe it, much easier for people to do without a lot of data munging. if you're interested, I'd like to line up the steps you describe with some sample data and see if I can bring all of into order. would you be up for making 1:1 contact, perhaps via email, and talking more about the details? if so, please contact me, and we can coordinate on working through this.

plantimals commented 5 years ago

I'm closing this issue as resolved for the moment. If anything else arises relevant to 2vcf from this collaboration, we'll open a new issue.