ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_search usage #55

Closed gadepallivs closed 9 years ago

gadepallivs commented 9 years ago

Hi David Winter, Thank you for the amazhing entrez package. 1) How to input filters in entrez_search funtion ? For instance, When we search pubmed online for BRAF AND LUNG, it results in some summary hits. On the left hand side of the page, we can select article types, text availability year of publication, species and so on. Selecting one of these helps to filter the summary output further. How do I input these parameters in entrez_search function ?

2) When we extract citation information for particular article ( Eg: Cited by 48 PubMed Central articles). Why is the count different from google scholar citations ?

3) How do we extract the links for the articles we fetched from pubmed using rentrez ? For Eg. I am trying to make a dataframe for the feteched pubmed abstracts, I would like to inluce a hyperlink in one column so that, user can click on the hyperlink that opens the fulltext or the abstract.

4) Finally, is there a way impact factor of the journal be extracted from NCBI. Not sure, If I saw impact factor listed in NCBI in general. If you are aware of an other packages that help to do so, please let me know. I am trying to create a table with NCBI abstract info, impact factor of journal and the link to the journal full text.

Thank you very much

dwinter commented 9 years ago

Hi @Monty9, thanks for your interest in rentrez and your questions. Let's take them one at a time.

  1. I don't think there is a way to filter the results of entrez_search that is equivalent to the interactive filtering online. Instead, you can either, (a) build new queries that narrow down you results or (b) fetch entrez_summary records for every match, and filter the results based on your criteria. So, for instance, you could make your search
big_search <- entrez_search(db="pubmed", term="BRAF AND LUNG")
big_search
Entrez search result with 545 hits (object contains 20 IDs and no web_history object)
 Search term (as translated):  BRAF[All Fields] AND ("lung"[MeSH Terms] OR "lung" ... 

That seems like too many, let's get only the recent ones (the fields denoted by [ are described in the new vignette and documented here.

recent_search <- entrez_search(db="pubmed", term="(BRAF) AND (LUNG) AND 2014:2015[PDAT]")
recent_search
Entrez search result with 250 hits (object contains 20 IDs and no web_history object)
 Search term (as translated):  BRAF[All Fields] AND ("lung"[MeSH Terms] OR "lung" ..

If you'd rather get them all and filter you could instead fetch summariesfor each (I'm just using 20 records, use retmax= or use_history to work with all the hits). This below uses the development version on rentrez (installed with devtools::install_github('ropensci/rentrez')):

braf_summs <- entrez_summary(db="pubmed", id=recent_search$ids)
paper_types <- extract_from_esummary(braf_summs, "pubtype")

This will give you a list of the "paper type" field. You could use this to find only the records that are reviews:

sapply(paper_types, function(x) "Review" %in% x)
26288737 
       3 
  1. Pubmed only knows about papers that are included in pubmed central, which is only those papers in (very broadly) biomedical fields and being open access. There are lots of closed papers and papers published in other journals that won't be included in the pmc citations but will get sucked up by Google Scholar.
  2. You get to use one of the new things I've been working on :smile:. In the development version entrez_link can give you this information. Working on this made it clear that what we have at the moment is a bit messy, but here's a work around that should do the trick. prlinks gives us the primary link for each record, extract_url is just a function to get them cleanly from the resulting list:
extract_url <- function(linkout){
  if (length(linkout) == 0){
    return(NA)
  }
  linkout[[1]][["Url"]]
}

links <- entrez_link(dbfrom="pubmed", id = recent_search$ids, cmd="prlinks"
urls <- sapply(links$linkouts, extract_url)
urls[3]
                                          ID_26288737 
"http://www.immunotherapyofcancer.org/content/3/1/36

Hope that's some help to you, and feel free to ask more questions if you have them

gadepallivs commented 9 years ago

Hi David,

Thank you very much for quick response. It really helped. :+1: I tried the following and it does give me exact results that online search does. The only key is to know the abbr. fields within [ ]

recent_search <- entrez_search(db = "pubmed", term = "BRAF, lung, Clinical Trial[ptyp], 2000:2015[PDAT], Human[MeSH]")

_Entrez search result with 16 hits (object contains 16 IDs and no webhistory object) Search term (as translated): (BRAF[All Fields] AND ("lung"[MeSH Terms] OR "lung ...

You missed one more 4 Q of mine. Is there a way in R with this package or some other package, Where I can retrieve the impact factor of the journals ?

Thank you

dwinter commented 9 years ago

Hi @Monty9 , happy to help.

In the dev. version you can get the list of fields for a given database with entrez_db_searchable("pubmed")

Sadly there is no way to fetch impact factors, since the complete set of impact factors (the Jorurnal Citation Reports) is proprietary information and locked up.

gadepallivs commented 9 years ago

Hi david,

Just realized that parameter field does not take multiple inputs. code `recent_search <- entrez_search(db = "pubmed", term = "BRAF, lung, Clinical Trial[ptyp], 2000:2015[PDAT], Human[MeSH]", field = c("TIAB," TITL","MeSH Terms")). Output [1] "(BRAF[All Fields] AND (\"lung\"[MeSH Terms] OR \"lung\"[All Fields])) OR (BRAF[All Fields] AND (\"lung\"[MeSH Terms] OR \"lung\"[All Fields])) OR (BRAF[All Fields] AND (\"lung\"[MeSH Terms] OR \"lung\"[All Fields]))"

If I give a single input ( title) "TITL" it searches for the input terms in this field. But, when I input additional search fields, by default it takes "ALL Fields". Is there a work around to pass fields ? I want to narrow the search as much as I can.

pubmed_search <- entrez_search(db = "pubmed", term = query, field = c("Title"), retmax = 20) [1] "BRAF[Title] AND lung[Title]"

dwinter commented 9 years ago

Hi @Monty9,

As you've found, the field argument applies to the whole query. The way to use the fields with AND/OR is to use the square bracekets. I think this is what you're going for in this case

q <- 'BRAF[TIAB]) AND LUNG[TIAB] AND (2000:2015[PDAT] AND Clinical Trial[PTYP] AND Human[MeSH] '
 recent_search <- entrez_search(db="pubmed", term=q)
 recent_search$QueryTranslation
[1] "BRAF[TIAB] AND LUNG[TIAB] AND 2000[PDAT] : 2015[PDAT] AND Clinical Trial[PTYP] AND \"humans\"[MeSH Terms]"

For complex queries like this you can use the pubmed query builder (which tends to use complete field names and sprinkle parentheses around...)

gadepallivs commented 9 years ago

Thank you David. Is there a way to use the pubmed query builder from R ?

dwinter commented 9 years ago

Not that I know of -- it's no the programmatic interface and I'm guessing that trying to use rvest or something similar to set all the fields would be more trouble than it's worth.

It would be neat to have an R-equivalent, something that could return a new query from a list of sub-queries and operators, but that's a big project and I doubt I'll get to it any time soon

On Wed, Sep 2, 2015 at 3:29 PM, Monty9 notifications@github.com wrote:

Thank you David. Is there a way to use the pubmed query builder from R ?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rentrez/issues/55#issuecomment-137263258.

David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University

ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism

gadepallivs commented 9 years ago

Hi david, Is it possible to extract ISSN number of journal in rentrez package ? I am trying macth the journal name against a dataframe with journal name and impact factor. Unfortunately, the journal name is not consistently abbreviated, so merge function results in NA. Hence, I thoght probably ISSN number will be unique and so helps me in matching exact journal and the impact factor. Thank you

dwinter commented 9 years ago

Hi Monty.

It is, but you will need to fetch an esummary record for each paper.

So, for instance, you could do

papers <- entrez_search(db="pubmed", term="cancer", retmax=10)
psumms  <- entrez_summary(db="pubmed", id=papers$ids)
extract_from_esummary(psumms, c("essn", "issn"))
     26327059    26326800    26326526    26325687    26325678    26325675
essn "1534-7362" "1534-7362" "1534-7362" "1932-6203" "1932-6203" "1932-6203"
issn ""          ""          ""          ""          ""          ""
     26325671    26325670    26325669    26325646
essn "1465-3931" "1465-3931" "1949-2553" ""
issn "0031-3025" "0031-3025" ""          "0363-6771"

I do'nt know what the different between an essn and issn is, but perhaps one or the other will match your tables?

On Thu, Sep 3, 2015 at 10:53 AM, Monty9 notifications@github.com wrote:

Hi david, Is it possible to extract ISSN number of journal in rentrez package ? I am trying macth the journal name against a dataframe with journal name and impact factor. Unfortunately, the journal name is not consistently abbreviated, so merge function results in NA. Hence, I thoght probably ISSN number will be unique and so helps me in matching exact journal and the impact factor. Thank you

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rentrez/issues/55#issuecomment-137527409.

David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University

ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism

dwinter commented 9 years ago

Closing this for now @Monty9 -- but feel free to ask fresh questions!

gadepallivs commented 9 years ago

Thank you David. You gave me good leads and as well the exact solution I was looking for.