ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

How to download large Clinvar tabular table #182

Open manburst opened 1 year ago

manburst commented 1 year ago

Dear Sir Now I am working project to search a lot of variant data in Clinvar which is time consuming because it seem that Clinvar limit 20 queries per search. For example I want to get 2 specific variant data so I search like this (2[chr] AND 47168774[chrpos37] AND TTC7A[gene] AND p.Leu32Val[varname] AND c.94C>G[varname]) OR (3[chr] AND 9974774[chrpos37] AND IL17RC[gene] AND p.Leu625Val[varname] AND c.1873C>G[varname]) I will get result like this image you will notice download which can export tabular data (or table file like in the image) Can Rentrez package download the tabular table file and could you please give example of command to fetch the data? And how limit of number of query for search with command? Thanks in advance JK

allenbaron commented 1 year ago

The package maintainer isn't currently available to reply. You may be able to find what you need in the Entrez Utilities documentation. The Clinvar docs do say that Clinvar is available via E-Utilties.

Apologies that I don't have more time to assist myself.

vestalgd commented 9 months ago

q1 <- "2[chr] AND 47168774[chrpos37] AND TTC7A[gene] AND p.Leu32Val[varname] AND c.94C>G[varname]) OR (3[chr] AND 9974774[chrpos37] AND IL17RC[gene] AND p.Leu625Val[varname] AND c.1873C>G[varname]"

search for gene or topic of interest

search <- entrez_search(db = "clinvar", term = q1, use_history = TRUE)

by adding the retmode = "xml", it will put out 9999 Clinvar variants at maximum; if you have more than 9999, figure out a way to chunk them up.

summary <- entrez_summary(db = "clinvar", web_history = search$web_history, retmode = "xml") summary summary_cv <- extract_from_esummary(summary, c("obj_type", "accession", "accession_version", "title", "variation_set", "trait_set", "supporting_submissions","clinical_significance","record_status", "gene_sort", "chr_sort", "location_sort", "variation_set_name", "variation_set_id", "genes", "protein_change", "fda_recognized_database"))

file output is a matrix; this code transposes the data (t()) and turns that results into a tibble; from here you can unnest_wider or unnest_longer to pull out the list columns

cv_extract_final <- summary_cv %>% t() %>% as_tibble(rownames = NA) %>% rownames_to_column(var="ID")