monarch-initiative / mckb

Monarch Cancer Knowledge Base
2 stars 1 forks source link

Add CCDS transcript to RefSeq protein accession and Uniprot accession #7

Closed kshefchek closed 9 years ago

kshefchek commented 9 years ago

We need a protein accession to link amino acid coordinates to their reference sequence. Will use the CCDS transcript ID and this mapping file from their FTP:

ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS2UniProtKB.current.txt

Could make CCDS its own source class or just add a function to convert this to a python dictionary.

kshefchek commented 9 years ago

Worth noting, the RefSeq URI in the Dipper yaml file will only work for nucleotide sequences, for example: www.ncbi.nlm.nih.gov/refseq/?term=NP_005148.2

I'll see if I can find a URI that works for both. Alternatively we could link these to NCBI protein, cc @nlwashington @bryanlaraway

kshefchek commented 9 years ago

Maybe is correct to link these to NCBI protein regardless. When searching for proteins in refseq you are forwarded to the protein database, see http://www.ncbi.nlm.nih.gov/protein?term=srcdb_refseq[prop]

kshefchek commented 9 years ago

@nlwashington @bryanlaraway I can't seem to get UniProtKB URIs to resolve using the UniProtKB mapping in our yaml configuration, for example:

http://identifiers.org/UniProt:P43489

Alternatively, this works: http://identifiers.org/uniprot/P43489

Can this be switched or will this break other resources?

nlwashington commented 9 years ago

please update to the correct curie mapping