Open maryemmaj opened 1 year ago
When trying to get cross database data (i.e. sequences from nucleotide database corresponding to samples from biosample database) its necessary to use entrez_link()
, which uses the IDs from the first database to get the corresponding IDs for records in the second. Identifiers differ by database.
library(rentrez)
library(XML)
search <- rentrez::entrez_search(
db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999,
use_history = TRUE
)
nuc_id <- rentrez::entrez_link(
dbfrom = "biosample",
web_history = search$web_history,
db = "nucleotide"
)
fetch_test <- rentrez::entrez_fetch(
db = "nucleotide",
id = nuc_id$links$biosample_nuccore,
rettype = "xml"
)
fetch_list <- XML::xmlToList(fetch_test)
Created on 2023-01-27 by the reprex package (v2.0.1)
FYI, not all NCBI databases have links to one another. Checking what databases have links to biosample returns the following:
library(rentrez)
entrez_db_links("biosample")
#> Databases with linked records for database 'biosample'
#> [1] assembly biocollections bioproject dbvar gap
#> [6] gds nuccore omim pubmed snp
#> [11] sra taxonomy
Created on 2023-01-27 by the reprex package (v2.0.1)
Even though the 'nucleotide' database is not listed, 'nuccore' is, which is why this still works (see https://www.biostars.org/p/161430/ for more details).
Thank you! How would I scale this up to get the data from all the sequences that I need? Using this code, I can perform the search and link, but I can't seem to perform entrez_fetch using a list of linked IDs because the list is too long.
`search <- rentrez::entrez_search( db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = TRUE )
nuc_id <- rentrez::entrez_link( dbfrom = "biosample", web_history = search$web_history, db = "nucleotide" )
fetch_test <- rentrez::entrez_fetch( db = "nucleotide", id = nuc_id$links$biosample_nuccore, rettype = "xml" )
fetch_list <- XML::xmlToList(fetch_test)`
After some searching, I tried to change the link function to get a web_history and fetch that way, but this code provides an error (HTTP failure: 400):
`search <- rentrez::entrez_search( db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = TRUE )
nuc_id <- rentrez::entrez_link( dbfrom = "biosample", web_history = search$web_history, db = "nucleotide", cmd = "neighbor_history" )
fetch_test <- rentrez::entrez_fetch( db = "nucleotide", id = nuc_id$web_histories, rettype = "xml" )
fetch_list <- XML::xmlToList(fetch_test)`
rentrez does have a bug with the post method (see my comment in PR #163) but I don't think that should affect you if you're only using the web_history
system.
It may be an issue with the number of records you're requesting at a time, see issue #178 for possible help.
I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample", term = "SAMN30954130[ACCN]", retmax = 9999, use_history = T) fetch_test <- entrez_fetch(db = "nucleotide", id = search$ids, rettype = "xml") fetch_list <- xmlToList(fetch_test)