Closed philippbayer closed 9 years ago
How does this work exactly? The rsIDs exist in either version, but with a different position?
Yes exactly! The rsID is supposed to be unique and consistent (some get retired though), here's for example rs9939609: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9939609
As you can see, it has two different positions for the human genome assembly versions GRCh38 and GrCh37 (there are many more prior to that which aren't listed anymore). There are contig positions too (genome assemblers first give you "contigs", tiny pieces of the chromosome - then you use other methods to sort these contigs into scaffolds, then into chromosomes) which we don't need.
Edit: In our database we have also many non-rsIDs, for example, SNP ids that are being used internally by 23andMe. For these we can't know the position (for older "MT"-SNPs the parsing job assigns the correct rsID), their assembly version field would have to be "NA" or something like that
Wouldn't that information belong to Genotype
then, since a single Snp
can have different sources/positions per Genotype
?
Those updates to the reference genome don't come too often, so I would say we more or less manually transfer the data stored in the DB from one build to another (so far we've done 36 -> 37 -> 38) as I don't see too much benefit in having the versioning tbh.
The rsIDs are unique & consistent as Philipp pointed out, so that should lead to no problems. And for the genotype files users can upload/download we are currently not messing with the raw data, as it's not always possible to figure out the version in any case?
So we basically only need a list of SNPs with their position according to the latest version and update the db accordingly on a regular basis?
Yes! But again, we can't know for 100% of our SNPs from which reference they come.
Also the list of SNPs is a bit huge (~1,000,000 SNPs), and that update of the db would be maybe yearly, maybe once every two years, I think there are not that many new human genome versions to come (there are ~four minor updates a year, but afaik these don't change the reference SNP positions)
@philippbayer You mean the ones which are not rsIDs?
Yes, the 23andme internals etc.
I see & agree. For those we will never know. There the only way is to wait until 23andMe themselves do the liftover.
Is there a place where we can download such a list in an automated fashion?
Yep, the dbSNP should work for that I guess: http://www.ncbi.nlm.nih.gov/SNP/
That website...
But that might just be me. To you this may make perfect sense... ;)
There is a link to an FTP server on the left. If I only knew which file is the one...
Maybe one of these? ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b141_GRCh38/database/organism_data/
No, I wholeheartedly agree. There's a reason why many bioinformatics courses focus lots of time & energy of using the different NCBI webservices :sob:
I think the VCF file should be good enough for us, details see here: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
DL-Link should be ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b142_GRCh38/VCF/00-All.vcf.gz (warning: 2.4 GB)
So, you literally need a degree for navigating this website? :D
Yes. When Philipp and I did our degree there were at least two course days on using NCBI and now I'm teaching those courses ;-)
You know NCBI is being confusing when they don't offer "tutorials", they only offer "books"
As it is written in the book of the words of NCBI-ah the prophet: "A voice of one calling in the wilderness, 'Prepare the way for the Lord, make straight $PATHs for him." :pray:
Once, I did export PATH=PATH:
pwd``
It's not a straight path when ls can't tell you the way
But to get back on topic. I think when I did the liftover, I downloaded the above vcf as well, wrote a tiny python script to make a tab-delimited file of rs-ids -> positions to use for the liftover, then deleted it all. IIRC that was even before Capistrano.
Sounds about right to me. Next time we do a liftover we shouldn't throw away the scripts for doing it though. :stuck_out_tongue_winking_eye: Do you think it makes sense to try a liftover to GRCh38 now or shall we close this until GRCh39 comes out? :D
Actually, the scripts are still there, I was just looking in the wrong, "new" folder of /srv, they are in /var/www (so before Capistrano)
all_positions.txt is GRCh38 so I must've done to liftover to 38, not 37. I wonder why the DAS is then at 37.
getPositions.rb has a nice, lazy way to get new positions for rs-iDs using the Ucsc genome browser API, so you can avoid NCBI's complicated ways :)
Probably we didn't update our DAS reference tracks, that's why. :wink:
We should make sure that our DAS tracks we deliver give the correct meta data. And also that we are using a correct reference track. Here should be a GRCh38 version: http://www.biodalliance.org/human38.html
Also see here: http://www.ensembl.org/das/dsn
So currently there's a minor confusion (again) with different SNP versions - we get the initial SNP's position from the genotyping we parse, so the "old" genotypings were GrCh37, the "new" genotypings have GrCh38, 23andMe switched over at some point.
At some point I did some work to lift over SNP positions to 38, should we display for each SNP from which human genome release version it is? We'd need one more field per SNP in the database, and of course the initial parsing would be a bit awkward, either get the version from a gigantic file of all SNPs or find a nice API that gives you SNP positions for rsID (Entrez?).