ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

Search and fetch are providing the wrong data? #185

Open maryemmaj opened 1 year ago

maryemmaj commented 1 year ago

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:

search <- entrez_search(db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = T)

Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:

search <- entrez_search(db = "biosample", term = "SAMN30954130[ACCN]", retmax = 9999, use_history = T) fetch_test <- entrez_fetch(db = "nucleotide", id = search$ids, rettype = "xml") fetch_list <- xmlToList(fetch_test)

allenbaron commented 1 year ago

When trying to get cross database data (i.e. sequences from nucleotide database corresponding to samples from biosample database) its necessary to use entrez_link(), which uses the IDs from the first database to get the corresponding IDs for records in the second. Identifiers differ by database.

library(rentrez)
library(XML)

search <- rentrez::entrez_search(
    db = "biosample",
    term = "SAMN30954130[ACCN]",
    retmax = 9999,
    use_history = TRUE
)
nuc_id <- rentrez::entrez_link(
    dbfrom = "biosample",
    web_history = search$web_history,
    db = "nucleotide"
)
fetch_test <- rentrez::entrez_fetch(
    db = "nucleotide",
    id = nuc_id$links$biosample_nuccore,
    rettype = "xml"
)
fetch_list <- XML::xmlToList(fetch_test)

Created on 2023-01-27 by the reprex package (v2.0.1)

## Result Note that this particular record has no sequence data associated with it. Additional steps may be needed to get it. ``` r fetch_list #> $GBSeq #> $GBSeq$GBSeq_locus #> [1] "ABIKJS010000000" #> #> $GBSeq$GBSeq_length #> [1] "203" #> #> $GBSeq$GBSeq_strandedness #> [1] "double" #> #> $GBSeq$GBSeq_moltype #> [1] "DNA" #> #> $GBSeq$GBSeq_topology #> [1] "linear" #> #> $GBSeq$GBSeq_division #> [1] "BCT" #> #> $GBSeq$`GBSeq_update-date` #> [1] "22-SEP-2022" #> #> $GBSeq$`GBSeq_create-date` #> [1] "22-SEP-2022" #> #> $GBSeq$GBSeq_definition #> [1] "Escherichia coli strain WSDA203, whole genome shotgun sequencing project" #> #> $GBSeq$`GBSeq_primary-accession` #> [1] "ABIKJS000000000" #> #> $GBSeq$`GBSeq_accession-version` #> [1] "ABIKJS000000000.1" #> #> $GBSeq$`GBSeq_other-seqids` #> $GBSeq$`GBSeq_other-seqids`$GBSeqid #> [1] "gb|ABIKJS000000000.1|ABIKJS010000000" #> #> $GBSeq$`GBSeq_other-seqids`$GBSeqid #> [1] "gi|2307876014" #> #> #> $GBSeq$GBSeq_project #> [1] "PRJNA613831" #> #> $GBSeq$GBSeq_keywords #> $GBSeq$GBSeq_keywords$GBKeyword #> [1] "WGS" #> #> $GBSeq$GBSeq_keywords$GBKeyword #> [1] "GMI" #> #> #> $GBSeq$GBSeq_source #> [1] "Escherichia coli" #> #> $GBSeq$GBSeq_organism #> [1] "Escherichia coli" #> #> $GBSeq$GBSeq_taxonomy #> [1] "Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia" #> #> $GBSeq$GBSeq_references #> $GBSeq$GBSeq_references$GBReference #> $GBSeq$GBSeq_references$GBReference$GBReference_reference #> [1] "1" #> #> $GBSeq$GBSeq_references$GBReference$GBReference_position #> [1] "1..203" #> #> $GBSeq$GBSeq_references$GBReference$GBReference_consortium #> [1] "GenomeTrakr network: Whole genome sequencing for foodborne pathogen traceback" #> #> $GBSeq$GBSeq_references$GBReference$GBReference_title #> [1] "Direct Submission" #> #> $GBSeq$GBSeq_references$GBReference$GBReference_journal #> [1] "Submitted (22-SEP-2022) Center for Food Safety and Applied Nutrition, US Food and Drug Administration, 5100 Paint Branch Pkwy, College Park, MD, USA" #> #> #> #> $GBSeq$GBSeq_comment #> [1] "The Escherichia coli whole genome shotgun (WGS) project has the project accession ABIKJS000000000. This version of the project (01) has the accession number ABIKJS010000000, and consists of sequences ABIKJS010000001-ABIKJS010000203.; The annotation was added by the assembly submitters using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about stand-alone PGAP can be found here: https://github.com/ncbi/pgap/; This draft WGS assembly was generated by running SKESA to generate a de-novo assembly. The de-novo assembly was then concatenated with contigs generated using a guided assembler using antimicrobial resistance genes as baits to comprehensively catalog the set of resistance genes in the isolate. Note, some parts of the contigs derived from the guided assembler may overlap de-novo contigs, and other guided assembler contigs. De-novo contigs can be differentiated from guided assembler contigs by their names , which include either 'denovo' or 'guided'.; ##Genome-Assembly-Data-START## ; Assembly Date :: 22-SEP-2022 ; Assembly Method :: SKESA v. 2.2 ; Assembly Name :: PDT001428147.1 ; Long Assembly Name :: NCBI Pathogen Detection Assembly PDT001428147.1 ; Genome Coverage :: 82x ; Sequencing Technology :: ILLUMINA ; ##Genome-Assembly-Data-END##; ##Genome-Annotation-Data-START## ; Annotation Date :: 09/22/2022 13:02:18 ; Annotation Method :: Best-placed reference protein set; GeneMarkS-2+ ; Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline (PGAP) ; Annotation Provider :: NCBI ; Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA; repeat_region ; Annotation Software revision :: 2021-01-11.build5132 ; Genes (total) :: 5,379 ; CDSs (total) :: 5,281 ; Genes (coding) :: 5,074 ; CDSs (with protein) :: 5,074 ; Genes (RNA) :: 98 ; rRNAs :: 1, 2, 1 (5S, 16S, 23S) ; complete rRNAs :: 1 (5S) ; partial rRNAs :: 2, 1 (16S, 23S) ; tRNAs :: 86 ; ncRNAs :: 8 ; Pseudo Genes (total) :: 207 ; CDSs (without protein) :: 207 ; Pseudo Genes (ambiguous residues) :: 0 of 207 ; Pseudo Genes (frameshifted) :: 93 of 207 ; Pseudo Genes (incomplete) :: 85 of 207 ; Pseudo Genes (internal stop) :: 62 of 207 ; Pseudo Genes (multiple problems) :: 26 of 207 ; CRISPR Arrays :: 1 ; ##Genome-Annotation-Data-END##" #> #> $GBSeq$`GBSeq_feature-table` #> $GBSeq$`GBSeq_feature-table`$GBFeature #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_key #> [1] "source" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_location #> [1] "1..203" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_intervals #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_intervals$GBInterval #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_intervals$GBInterval$GBInterval_from #> [1] "1" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_intervals$GBInterval$GBInterval_to #> [1] "203" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_intervals$GBInterval$GBInterval_accession #> [1] "ABIKJS000000000.1" #> #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "organism" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "Escherichia coli" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "mol_type" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "genomic DNA" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "strain" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "WSDA203" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "isolation_source" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "retail raw whole milk goat" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "db_xref" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "taxon:562" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "country" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "USA: WA" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "collection_date" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "2022-09-06" #> #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_name #> [1] "collected_by" #> #> $GBSeq$`GBSeq_feature-table`$GBFeature$GBFeature_quals$GBQualifier$GBQualifier_value #> [1] "Washington State Department of Agriculture" #> #> #> #> #> #> $GBSeq$`GBSeq_alt-seq` #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData$GBAltSeqData_name #> [1] "WGS" #> #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData$GBAltSeqData_items #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData$GBAltSeqData_items$GBAltSeqItem #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData$GBAltSeqData_items$GBAltSeqItem$`GBAltSeqItem_first-accn` #> [1] "ABIKJS010000001" #> #> $GBSeq$`GBSeq_alt-seq`$GBAltSeqData$GBAltSeqData_items$GBAltSeqItem$`GBAltSeqItem_last-accn` #> [1] "ABIKJS010000203" #> #> #> #> #> #> $GBSeq$GBSeq_xrefs #> $GBSeq$GBSeq_xrefs$GBXref #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_dbname #> [1] "BioProject" #> #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_id #> [1] "PRJNA613831" #> #> #> $GBSeq$GBSeq_xrefs$GBXref #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_dbname #> [1] "BioSample" #> #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_id #> [1] "SAMN30954130" #> #> #> $GBSeq$GBSeq_xrefs$GBXref #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_dbname #> [1] "Sequence Read Archive" #> #> $GBSeq$GBSeq_xrefs$GBXref$GBXref_id #> [1] "SRR21667880" ```

FYI, not all NCBI databases have links to one another. Checking what databases have links to biosample returns the following:

library(rentrez)
entrez_db_links("biosample")
#> Databases with linked records for database 'biosample'
#>  [1] assembly       biocollections bioproject     dbvar          gap           
#>  [6] gds            nuccore        omim           pubmed         snp           
#> [11] sra            taxonomy

Created on 2023-01-27 by the reprex package (v2.0.1)

Even though the 'nucleotide' database is not listed, 'nuccore' is, which is why this still works (see https://www.biostars.org/p/161430/ for more details).

maryemmaj commented 1 year ago

Thank you! How would I scale this up to get the data from all the sequences that I need? Using this code, I can perform the search and link, but I can't seem to perform entrez_fetch using a list of linked IDs because the list is too long.

`search <- rentrez::entrez_search( db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = TRUE )

nuc_id <- rentrez::entrez_link( dbfrom = "biosample", web_history = search$web_history, db = "nucleotide" )

request is too large

fetch_test <- rentrez::entrez_fetch( db = "nucleotide", id = nuc_id$links$biosample_nuccore, rettype = "xml" )

fetch_list <- XML::xmlToList(fetch_test)`

After some searching, I tried to change the link function to get a web_history and fetch that way, but this code provides an error (HTTP failure: 400):

`search <- rentrez::entrez_search( db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = TRUE )

this seems to be working now

nuc_id <- rentrez::entrez_link( dbfrom = "biosample", web_history = search$web_history, db = "nucleotide", cmd = "neighbor_history" )

request is too large

fetch_test <- rentrez::entrez_fetch( db = "nucleotide", id = nuc_id$web_histories, rettype = "xml" )

fetch_list <- XML::xmlToList(fetch_test)`

allenbaron commented 1 year ago

rentrez does have a bug with the post method (see my comment in PR #163) but I don't think that should affect you if you're only using the web_history system.

It may be an issue with the number of records you're requesting at a time, see issue #178 for possible help.