ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Support "by Id" for `entrez_link` #51

Closed momeara closed 9 years ago

momeara commented 9 years ago

Thanks for the nice package--

I would like to look up all protein_ids for each gene_id in a set. However, when I do

entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))$links$gene_protein

I get

[1] "768043930" "767953815" "558472750" "194394158" "166221824" "154936864" "148697547" "148697546" "119602646" "119602645" "119602644" "119602643" "119602642" "81899807"  "74215266"  "74186774"  "37787317"  "37787309"  "37787307"  "37787305"  "37589273"  "33991172" "31982089"  "26339824"  "26329351"  "21619615"  "10834676"

and I'm not sure which protein_id corresponds with which input gene_id.

Looking at the E-utils documentation, appears that supplying multiple identifiers in the id field in the url groups all the returned links together into a single batch. To get separate links in what they call "by Id" mode, separate id fields can be supplied in the url. (http://www.ncbi.nlm.nih.gov/books/NBK25500/#_chapter1_Finding_Related_Data_Through_En_). As far as I can tell, the WebDev interface has similar restrictions/capabilities.

Would it be possible to support this "by Id" mode with the entrez_link function? The interface could perhaps be an additional input argument by_id and the resulting output would be a list of elink lists, one for each input identifier.

dwinter commented 9 years ago

Hi @momeara , thanks for the tip about this mode, which is new to me.

Looks like this should be do-able in rentrez. I've made a start on branch referenced in the above commit. As you can see, the XML returned is different when by_id is set:

rec_old <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
rec_old$file
<eLinkResult>
  <LinkSet>
    <DbFrom>gene</DbFrom>
    <IdList>
      <Id>93100</Id>
      <Id>223646</Id>
    </IdList>
    <LinkSetDb>
      <DbTo>protein</DbTo>
      <LinkName>gene_protein</LinkName>
      <Link>
        <Id>768043930</Id>
      </Link>
      <Link>
        <Id>767953815</Id>
      </Link>
.
.
.
rec_new <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
rec_new$file
<eLinkResult>
  <LinkSet>
    <DbFrom>gene</DbFrom>
    <IdList>
      <Id>93100</Id>
    </IdList>
    <LinkSetDb>
      <DbTo>protein</DbTo>
      <LinkName>gene_protein</LinkName>
      <Link>
        <Id>768043930</Id>
      </Link>
      <Link>
        <Id>767953815</Id>
      </Link>
.
.
.

It will take a little while to write a new parser for this form of the XML (or modify the old one). But this should be easy enough to include in the next release :smile:

momeara commented 9 years ago

Thanks for taking a look :+1:

dwinter commented 9 years ago

Hi @momeara , just checked in support for this. Here's the example I'm using in the vignette

all_links_sep  <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
all_links_sep
List of 2 elink objects,each containing
  $links: IDs for linked records from NCBI
lapply(all_links_sep, function(x) x$links$gene_protein)
[[1]]
 [1] "768043930" "767953815" "558472750" "194394158" "166221824" "154936864"
 [7] "119602646" "119602645" "119602644" "119602643" "119602642" "37787309" 
[13] "37787307"  "37787305"  "33991172"  "21619615"  "10834676" 

[[2]]
 [1] "148697547" "148697546" "81899807"  "74215266"  "74186774"  "37787317" 
 [7] "37589273"  "31982089"  "26339824"  "26329351"

So, basically as you suggested, a list of elink object (with a special prin function so you don't get a screen-full of them if you send a lot of IDs).

Thanks again for pointing this mode behaviour out to me, and hope this helps

momeara commented 9 years ago

This looks fantastic. Thanks for the rapid response!

On Sun, Jul 19, 2015 at 9:11 PM, David Winter notifications@github.com wrote:

Hi @momeara https://github.com/momeara , just checked in support for this. Here's the example I'm using in the vignette

all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)all_links_sep

List of 2 elink objects,each containing $links: IDs for linked records from NCBI

lapply(all_links_sep, function(x) x$links$gene_protein)

[[1]] [1] "768043930" "767953815" "558472750" "194394158" "166221824" "154936864" [7] "119602646" "119602645" "119602644" "119602643" "119602642" "37787309" [13] "37787307" "37787305" "33991172" "21619615" "10834676"

[[2]] [1] "148697547" "148697546" "81899807" "74215266" "74186774" "37787317" [7] "37589273" "31982089" "26339824" "26329351"

So, basically as you suggested, a list of elink object (with a special prin function so you don't get a screen-full of them if you send a lot of IDs).

Thanks again for pointing this mode behaviour out to me, and hope this helps

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rentrez/issues/51#issuecomment-122724089.