SRA Automation Specification

schifferl commented 6 years ago

Overview

The curatedMetagenomicData package derives all of its sequencing data from the NCBI Sequence Read Archive (SRA) and combines this information with curated study/sample metadata. Currently, sequencing data is obtained manually from SRA identifiers and we wish to automate the process. From a BioProject ID it is possible to obtain all SRA run identifiers from a study (i.e. SRR identifiers) that can then be used to download and process sequencing data. Using the SRAdbV2 package, we would like to produce a table of SRR identifiers and all required metadata fields. This table will then be used in downstream automation.

Process

Wrapping `SRAdbV2`

The first step will be to wrap the SRAdbV2 package such that from only a BioProject ID a table of SRR identifiers can be produced. The call should resemble the following.

get_SRR(BioProject_ID = "339914")

The method should return a data.frame containing a single SRR identifier on each line, as well as the BioProject ID. The output might look like the following.

BioProject_ID	SRR_ID
339914	SRR4052021
339914	SRR4052022
339914	SRR4052033
339914	...

Required Metadata Fields

Once the table of SRR identifiers has been produced, it should then be augmented with all required fields from the metadata template file. These fields should be inserted as columns of blank values to be filled in during curation and produce output similar to the following.

BioProject_ID	SRR_ID	sampleID	subjectID	body_site	country	...
339914	SRR4052021	...	...	...	...	...
339914	SRR4052022	...	...	...	...	...
339914	SRR4052033	...	...	...	...	...
339914	...	...	...	...	...	...

Tab Delimited Output

Finally, the data.frame output should be written to a TSV (tab separated value) file using the readr package. If the entire process was written as a series of steps using the magrittr package, it should look something like the following.

get_SRR(BioProject_ID = "339914") %>%
  add_required_metadata() %>%
  write_tsv(file = "inst/curated/AsnicarF_2017/AsnicarF_2017_metadata.tsv")

References

lwaldron commented 4 years ago

Update: better to use omicidx API directly, for example for Asnicar (Bioproject accession PRJNA339914), the Study ID is SRP082656 (see https://www.ncbi.nlm.nih.gov/books/NBK56913/ for explanation of accession ID types). Note: All bioprojects from SRA will have SRP, ERP, or DRP accessions. The information will be the same.

https://api.omicidx.cancerdatasci.org/sra/studies/SRP082656/runs (best viewed in Firefox)

lwaldron commented 4 years ago

Look at AsnicarF_2017 (example with two body sites) and BritoIL_2016 (example with multiple SRR accessions per sample). The "NCBI_accession" column provides the input for the curatedMetagenomicData_pipeline.sh function, e.g. as shown in curatedMetagenomicData_pipeline_allsamples.sh.

For a BioProject accession or Study accession, we would like:

associated sample IDs
associated run IDs (make semicolon-delimited if there are multiple run IDs for a sample)
sequencing platform
DNA extraction kit?
PMID ?
number_reads = spots * 2(if PAIRED) per sample
average read length per sample
any other sample metadata

cmirzayi commented 4 years ago

An initial version of this function has been added in https://github.com/waldronlab/curatedMetagenomicDataCuration/commit/f7dffac9a8631bafd85c41f7eb0b852cb786c03c. Currently it uses an SRP and returns the following:

Associated sample IDs
Associated run IDS (as separate rows--not semicolon-delimited yet due to varying number_reads across different runs--which do I use? An average?)
Sequencing platform
Number of reads
Average read length

lwaldron commented 4 years ago

This is great @cmirzayi! A few code suggestions, but this already looks ready to use.

Add roxygen2 markup to R/get_metadata.R. RStudio has a helper for creating the template for you. With roxygen2 you can:
- use directives like #' @importFrom jsonlite fromJSON to specify which functions you import from which packages. This isn't strictly necessary since you use :: (which I like), but then you can let roxygen2 update the NAMESPACE file for you. You'll need the packages jsonlite and curl added as "Imports"
- remove the library(jsonlite) and library(curl) commands from above the function (which actually won't even get run because they're not inside the function). Either the ImportFrom or the :: (or both) are adequate to make sure the commands are used from the right package, and may be more lightweight than loading the entire package namespace by using library inside the function.
- add your title, etc to the function's markup
Add a unit test in the tests/testthat directory, that just applies the function to an example study and checks that you get the result you expect.

cmirzayi commented 4 years ago

These changes have been made in https://github.com/waldronlab/curatedMetagenomicDataCuration/commit/fc2a4d71b3a6fbf01099de7acc4482cb1239d831

lwaldron commented 3 years ago

Works well, closing.

waldronlab / curatedMetagenomicDataCuration