ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Getting gene id from taxon name #141

Closed sharifX closed 3 years ago

sharifX commented 4 years ago

Hello, I am new to R and also to NCBI database. I am trying out some simple searches to understand the package. I started with a simple search based on a name.

taxon_search = entrez_search(db="taxonomy", term="Spirometra erinaceieuropaei")
> taxon_search$ids
[1] "99802"

So far so good. Now I want to use this taxon id to see all the links in the database. I am not sure if this is the right query. Why do I get only one link?

> all_the_links <- entrez_link(dbfrom='gene', id=99802, db='all')
> all_the_links$links
elink result with information from 1 databases:
[1] gene_gene_h3k4me3

If I start with a gene id, I get more links:

> all_the_links <- entrez_link(dbfrom='gene', id=6446572, db='all')
> all_the_links$links
elink result with information from 17 databases:
 [1] gene_genome                gene_bioproject           
 [3] gene_cdd                   gene_gene_h3k4me3         
 [5] gene_gene_neighbors        gene_genome2              
 [7] gene_nuccore               gene_nuccore_pos          
 [9] gene_nucleotide            gene_nucleotide_pos       
[11] gene_pmc_nucleotide        gene_protein              
[13] gene_protein_refseq        gene_proteinclusters      
[15] gene_pubmed_pmc_nucleotide gene_sparcle              
[17] gene_taxonomy  

And from there I can find the taxon id.

> all_the_links$links$gene_taxonomy
[1] "99802"

How do I find a link to 6446572 from the taxon search.

thanks. --sharif

sharifX commented 4 years ago

OK. I think I figured it out. It seems I need

>all_the_links <- entrez_link(dbfrom='taxonomy', id=99802, db='all')
> all_the_links$links$taxonomy_gene
 [1] "6446594" "6446593" "6446592" "6446591" "6446590" "6446589" "6446588"
 [8] "6446587" "6446586" "6446585" "6446584" "6446583" "6446582" "6446581"
[15] "6446580" "6446579" "6446578" "6446577" "6446576" "6446575" "6446574"
[22] "6446573" "6446572" "6446571" "6446570" "6446569" "6446568" "6446567"
[29] "6446566" "6446565" "6446564" "6446563" "6446562" "6446561" "6446560"
[36] "6446559"

What is the difference between dbfrom='taxonomy' and dbfrom='gene'?

--sharif

dwinter commented 3 years ago

Hi @sharifX , sorry for letting this issue sit idle for so long and well done discovering the issue yourself.

In entrez_link, dbfrom is the database from which the provided IDs were gathered. So if you set dbfrom to "taxonomy" the NCBI assumes the IDs are from the taxonomy database.