Closed schifferl closed 3 years ago
Update: better to use omicidx API directly, for example for Asnicar (Bioproject accession PRJNA339914), the Study ID is SRP082656 (see https://www.ncbi.nlm.nih.gov/books/NBK56913/ for explanation of accession ID types). Note: All bioprojects from SRA will have SRP, ERP, or DRP accessions. The information will be the same.
https://api.omicidx.cancerdatasci.org/sra/studies/SRP082656/runs (best viewed in Firefox)
Look at AsnicarF_2017 (example with two body sites) and BritoIL_2016 (example with multiple SRR accessions per sample). The "NCBI_accession" column provides the input for the curatedMetagenomicData_pipeline.sh function, e.g. as shown in curatedMetagenomicData_pipeline_allsamples.sh.
For a BioProject accession or Study accession, we would like:
An initial version of this function has been added in https://github.com/waldronlab/curatedMetagenomicDataCuration/commit/f7dffac9a8631bafd85c41f7eb0b852cb786c03c. Currently it uses an SRP and returns the following:
This is great @cmirzayi! A few code suggestions, but this already looks ready to use.
#' @importFrom jsonlite fromJSON
to specify which functions you import from which packages. This isn't strictly necessary since you use ::
(which I like), but then you can let roxygen2 update the NAMESPACE file for you. You'll need the packages jsonlite and curl added as "Imports"library(jsonlite)
and library(curl)
commands from above the function (which actually won't even get run because they're not inside the function). Either the ImportFrom or the ::
(or both) are adequate to make sure the commands are used from the right package, and may be more lightweight than loading the entire package namespace by using library
inside the function. These changes have been made in https://github.com/waldronlab/curatedMetagenomicDataCuration/commit/fc2a4d71b3a6fbf01099de7acc4482cb1239d831
Works well, closing.
Overview
The
curatedMetagenomicData
package derives all of its sequencing data from the NCBI Sequence Read Archive (SRA) and combines this information with curated study/sample metadata. Currently, sequencing data is obtained manually from SRA identifiers and we wish to automate the process. From a BioProject ID it is possible to obtain all SRA run identifiers from a study (i.e. SRR identifiers) that can then be used to download and process sequencing data. Using theSRAdbV2
package, we would like to produce a table of SRR identifiers and all required metadata fields. This table will then be used in downstream automation.Process
Wrapping
SRAdbV2
The first step will be to wrap the
SRAdbV2
package such that from only a BioProject ID a table of SRR identifiers can be produced. The call should resemble the following.The method should return a
data.frame
containing a single SRR identifier on each line, as well as the BioProject ID. The output might look like the following.Required Metadata Fields
Once the table of SRR identifiers has been produced, it should then be augmented with all required fields from the metadata template file. These fields should be inserted as columns of blank values to be filled in during curation and produce output similar to the following.
Tab Delimited Output
Finally, the
data.frame
output should be written to a TSV (tab separated value) file using thereadr
package. If the entire process was written as a series of steps using themagrittr
package, it should look something like the following.References