ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_link randomly doesn't return links #123

Closed willgearty closed 6 years ago

willgearty commented 6 years ago

I'm running the following code:

nerit_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Neritimorpha[ORGN]", retmax=5000, use_history=TRUE)
temp1 <- entrez_link(dbfrom = "nuccore", db = "taxonomy", id = nerit_coi$ids[1:200], by_id = TRUE)
temp2 <- unlist(lapply(temp1, function(x) if(exists("nuccore_taxonomy", where = x$links)) x$links$nuccore_taxonomy else NA))

temp2 should consist of the taxonomic IDs returned from the entrez_link function. However, after running this multiple times, it appears that entrez_link non-deterministically does not return links as it should, resulting in NA's in temp2.

dwinter commented 6 years ago

Hi @willgearty ,

Thanks for filing this issue, obviously pretty hard to troubleshoot these phantom errors. I can't reproduce this here, but it would be helpful to get a little more information on those records that are missing the links.

If you can get an example with missing links, can you run the following (not tested, so may need tweaks, the idea is to get the file attribute for each list entry that lacks the taxonomy links) and let me now what you get.

missing <-which(sapply(temp1, function(x) is.null(x$links$nuccore_taxonomy)))
sapply(temp1[missing], "[[", "file")

If the file attribute doesn't' include the nuccore_taxonomy link then this is something we'll need to report this to the NCBI

willgearty commented 6 years ago

Thanks for the quick reply @dwinter ! I went ahead and ran my code followed by your code. Here's the results:

[[1]]
<LinkSet>
  <DbFrom>nuccore</DbFrom>
  <IdList>
    <Id>1360488304</Id>
  </IdList>
</LinkSet> 

[[2]]
<LinkSet>
  <DbFrom>nuccore</DbFrom>
  <IdList>
    <Id>1360488286</Id>
  </IdList>
</LinkSet> 

[[3]]
<LinkSet>
  <DbFrom>nuccore</DbFrom>
  <IdList>
    <Id>1142915454</Id>
  </IdList>
</LinkSet>
dwinter commented 6 years ago

OK, so it's an issue on the NCBI side, have sent an email to their helpdesk and copied you in (so hopefully you will get the reply). Until then we wait, I guess...

willgearty commented 6 years ago

I decided to just run this a whole bunch of times, and while it definitely doesn't have the same result every time, there does seem to be repetition in the ones that it errors on.

For example, running this:

for(i in seq(1,100)){
    temp1 <- entrez_link(dbfrom = "nuccore", db = "taxonomy", id = nerit_coi$ids[1:200], by_id = TRUE)
    missing <- c(missing, which(sapply(temp1, function(x) is.null(x$links$nuccore_taxonomy))))
}

Resulted in this for missing:

[1]  46  46   9 171  18  45  46 171  45 171  18 171 171   9  18  46  18  45  45  45  18   9  46   9  46
[26]   9 101  46   9  46   9  18   9  18  46  46 101 171  18 101  45  46  46   9  18  46  45  46   9  46
[51]  46 171  46 171  46 171   9 171  45  46  18   9 171  45   9  45 171   9 171  18   9  45  46  46 171
[76] 171  46 171   9
dwinter commented 6 years ago

Thanks for the extra info @willgearty,

For anyone from NCBI reading this thread, ID number 45, which appears to frequently produce this error is 1360488232.

I can get the problem we are talking about (a missing <LinkSetDb> for this ID) to occur just by refreshing the following URL a few times (this ID is the last one):

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=taxonomy&id=1360488318&id=1360488316&id=1360488314&id=1360488232

willgearty commented 6 years ago

Fixed by NCBI.