openSNP / snpr

The sources of the openSNP website
http://opensnp.org
MIT License
170 stars 46 forks source link

Store versions of SNP positions? #170

Closed philippbayer closed 9 years ago

philippbayer commented 9 years ago

So currently there's a minor confusion (again) with different SNP versions - we get the initial SNP's position from the genotyping we parse, so the "old" genotypings were GrCh37, the "new" genotypings have GrCh38, 23andMe switched over at some point.

At some point I did some work to lift over SNP positions to 38, should we display for each SNP from which human genome release version it is? We'd need one more field per SNP in the database, and of course the initial parsing would be a bit awkward, either get the version from a gigantic file of all SNPs or find a nice API that gives you SNP positions for rsID (Entrez?).

tsujigiri commented 9 years ago

How does this work exactly? The rsIDs exist in either version, but with a different position?

philippbayer commented 9 years ago

Yes exactly! The rsID is supposed to be unique and consistent (some get retired though), here's for example rs9939609: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9939609

As you can see, it has two different positions for the human genome assembly versions GRCh38 and GrCh37 (there are many more prior to that which aren't listed anymore). There are contig positions too (genome assemblers first give you "contigs", tiny pieces of the chromosome - then you use other methods to sort these contigs into scaffolds, then into chromosomes) which we don't need.

Edit: In our database we have also many non-rsIDs, for example, SNP ids that are being used internally by 23andMe. For these we can't know the position (for older "MT"-SNPs the parsing job assigns the correct rsID), their assembly version field would have to be "NA" or something like that

tsujigiri commented 9 years ago

Wouldn't that information belong to Genotype then, since a single Snp can have different sources/positions per Genotype?

gedankenstuecke commented 9 years ago

Those updates to the reference genome don't come too often, so I would say we more or less manually transfer the data stored in the DB from one build to another (so far we've done 36 -> 37 -> 38) as I don't see too much benefit in having the versioning tbh.

The rsIDs are unique & consistent as Philipp pointed out, so that should lead to no problems. And for the genotype files users can upload/download we are currently not messing with the raw data, as it's not always possible to figure out the version in any case?

tsujigiri commented 9 years ago

So we basically only need a list of SNPs with their position according to the latest version and update the db accordingly on a regular basis?

philippbayer commented 9 years ago

Yes! But again, we can't know for 100% of our SNPs from which reference they come.

Also the list of SNPs is a bit huge (~1,000,000 SNPs), and that update of the db would be maybe yearly, maybe once every two years, I think there are not that many new human genome versions to come (there are ~four minor updates a year, but afaik these don't change the reference SNP positions)

gedankenstuecke commented 9 years ago

@philippbayer You mean the ones which are not rsIDs?

philippbayer commented 9 years ago

Yes, the 23andme internals etc.

gedankenstuecke commented 9 years ago

I see & agree. For those we will never know. There the only way is to wait until 23andMe themselves do the liftover.

tsujigiri commented 9 years ago

Is there a place where we can download such a list in an automated fashion?

gedankenstuecke commented 9 years ago

Yep, the dbSNP should work for that I guess: http://www.ncbi.nlm.nih.gov/SNP/

tsujigiri commented 9 years ago

That website...

tsujigiri commented 9 years ago

But that might just be me. To you this may make perfect sense... ;)

tsujigiri commented 9 years ago

There is a link to an FTP server on the left. If I only knew which file is the one...

tsujigiri commented 9 years ago

Maybe one of these? ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b141_GRCh38/database/organism_data/

gedankenstuecke commented 9 years ago

No, I wholeheartedly agree. There's a reason why many bioinformatics courses focus lots of time & energy of using the different NCBI webservices :sob:

I think the VCF file should be good enough for us, details see here: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

DL-Link should be ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b142_GRCh38/VCF/00-All.vcf.gz (warning: 2.4 GB)

tsujigiri commented 9 years ago

So, you literally need a degree for navigating this website? :D

gedankenstuecke commented 9 years ago

Yes. When Philipp and I did our degree there were at least two course days on using NCBI and now I'm teaching those courses ;-)

philippbayer commented 9 years ago

You know NCBI is being confusing when they don't offer "tutorials", they only offer "books"

gedankenstuecke commented 9 years ago

As it is written in the book of the words of NCBI-ah the prophet: "A voice of one calling in the wilderness, 'Prepare the way for the Lord, make straight $PATHs for him." :pray:

philippbayer commented 9 years ago

Once, I did export PATH=PATH:pwd`` It's not a straight path when ls can't tell you the way

But to get back on topic. I think when I did the liftover, I downloaded the above vcf as well, wrote a tiny python script to make a tab-delimited file of rs-ids -> positions to use for the liftover, then deleted it all. IIRC that was even before Capistrano.

gedankenstuecke commented 9 years ago

Sounds about right to me. Next time we do a liftover we shouldn't throw away the scripts for doing it though. :stuck_out_tongue_winking_eye: Do you think it makes sense to try a liftover to GRCh38 now or shall we close this until GRCh39 comes out? :D

philippbayer commented 9 years ago

Actually, the scripts are still there, I was just looking in the wrong, "new" folder of /srv, they are in /var/www (so before Capistrano)

all_positions.txt is GRCh38 so I must've done to liftover to 38, not 37. I wonder why the DAS is then at 37.

getPositions.rb has a nice, lazy way to get new positions for rs-iDs using the Ucsc genome browser API, so you can avoid NCBI's complicated ways :)

gedankenstuecke commented 9 years ago

Probably we didn't update our DAS reference tracks, that's why. :wink:
We should make sure that our DAS tracks we deliver give the correct meta data. And also that we are using a correct reference track. Here should be a GRCh38 version: http://www.biodalliance.org/human38.html

gedankenstuecke commented 9 years ago

Also see here: http://www.ensembl.org/das/dsn