Python REST API for Entrez E-Utilities, aiming to be easy to use and reliable.
Easy-entrez:
from easy_entrez import EntrezAPI
entrez_api = EntrezAPI(
'your-tool-name',
'e@mail.com',
# optional
return_type='json'
)
# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)
# data will be populated with JSON or XML (depending on the `return_type` value)
result.data
See more in the Demo notebook and documentation.
For a real-world example (i.e. used for this publication) see notebooks in multi-omics-state-of-the-field repository.
Fetch the SNP record for rs6311
:
rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311
Display the result:
from easy_entrez.parsing import xml_to_string
print(xml_to_string(rs6311))
Find the gene names for rs6311
:
namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
name.text
for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)
['HTR2A']
Fetch data for multiple variants at once:
result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
'rs' + document_summary.get('uid'): [
element.text
for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
for document_summary in result.data
}
print(gene_names)
{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}
from pandas import DataFrame
result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
variant_positions = DataFrame([
{
'id': 'rs' + document_summary.get('uid'),
'chromosome': chromosome,
'position': position
}
for document_summary in result.data
for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
for chromosome, position in [chrom_and_position.text.split(':')]
])
variant_positions
id chromosome position 0 rs6311 13 46897343 1 rs662138 6 160143444
Parsing utilities can quickly extract the data to a VariantSet
object
holding pandas DataFrame
s with coordinates and alternative alleles frequencies:
from easy_entrez.parsing import parse_dbsnp_variants
variants = parse_dbsnp_variants(result)
variants
<VariantSet with 2 variants>
To get the coordinates:
variants.coordinates
rs_id ref alts chrom pos chrom_prev pos_prev consequence rs6311 C A,T 13 46897343 13 47471478 upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant rs662138 C G 6 160143444 6 160564476 intron_variant
For frequencies:
variants.alt_frequencies.head(5) # using head to only display first 5 for brevity
rs_id allele source_frequency total_count study count 0 rs6311 T 0.44349 2221 1000Genomes 984.991 1 rs6311 T 0.411261 1585 ALSPAC 651.849 2 rs6311 T 0.331696 1486 Estonian 492.9 3 rs6311 T 0.35 14 GENOME_DK 4.9 4 rs6311 T 0.402529 56309 GnomAD 22666
You can use the query string directly:
results = entrez_api.search(
'13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
database='snp',
max_results=10
)
print(results.data['esearchresult']['idlist'])
['59296319', '17076752', '7336701', '4']
Or pass a dictionary (no validation of arguments is performed, AND
conjunction is used):
results = entrez_api.search(
dict(chromosome=13, organism='human', position=31873085),
database='snp',
max_results=10
)
print(results.data['esearchresult']['idlist'])
['59296319', '17076752', '7336701', '4']
The base position should use the latest genome assembly (GRCh38 at the time of writing);
you can use the position in previous assembly coordinates by replacing POSITION
with POSITION_GRCH37
.
For more information of the arguments accepted by the SNP database see the entrez help page on NCBI website.
First we search for dbSNP rs identifiers for variants in given region:
dbsnp_ids = (
entrez_api
.search(
'12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',
database='snp',
max_results=100
)
.data
['esearchresult']
['idlist']
)
Then fetch the variant data for identifiers:
variant_data = entrez_api.fetch(
['rs' + rs_id for rs_id in dbsnp_ids],
max_results=10,
database='snp'
)
And parse the data, extracting the HGVS out of summary:
from easy_entrez.parsing import parse_dbsnp_variants
from pandas import Series
def select_protein_hgvs(items):
return [
[sequence, hgvs]
for entry in items
for sequence, hgvs in [entry.split(':')]
if hgvs.startswith('p.')
]
protein_hgvs = (
parse_dbsnp_variants(variant_data)
.summary
.HGVS
.apply(select_protein_hgvs)
.explode()
.dropna()
.apply(Series)
.rename(columns={0: 'sequence', 1: 'hgvs'})
)
protein_hgvs.head()
rs_id sequence hgvs rs1940853486 NP_006437.3 p.Gly203Ter rs1940853414 NP_006437.3 p.Glu202Gly rs1940853378 NP_006437.3 p.Glu202Lys rs1940853299 NP_006437.3 p.Lys201Thr rs1940852987 NP_006437.3 p.Asp198Glu
Use in_batches_of
method to fetch more than 10k entries (e.g. variant_ids
):
snps_result = (
entrez.api
.in_batches_of(1_000)
.fetch(variant_ids, max_results=5_000, database='snp')
)
The result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use parse_dbsnp_variants
directly on this dictionary.
When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:
def doi_term(doi: str) -> str:
"""Clean a DOI string by removing URL prefix."""
doi = (
doi
.replace('http://', 'https://')
.replace('https://doi.org/', '')
)
return f'"{doi}"[Publisher ID]'
result = entrez_api.search(
doi_term('https://doi.org/10.3389/fcell.2021.626821'),
database='pubmed',
max_results=1
)
print(result.data['esearchresult']['idlist'])
['33834021']
Requires Python 3.6+ (though only 3.7+ is tested). Install with:
pip install easy-entrez
If you wish to enable (optional, tqdm
-based) progress bars use:
pip install easy-entrez[with_progress_bars]
If you wish to enable (optional, pandas
-based) parsing utilities use:
pip install easy-entrez[with_parsing_utils]
To build the documentation locally:
pip install -e .[docs]
sphinx-build docs docs/_build
open docs/_build/index.html
You might want to try: