ropensci / onekp

Access sequences from the 1000 Plant Initiative (1KP)
https://docs.ropensci.org/onekp
Other
13 stars 4 forks source link
r r-package rstats

stable Travis-CI Build Status Coverage Status

onekp

The 1000 Plants initiative (1KP) provides the transcriptome sequences to over 1000 plants from diverse lineages. onekp allows researchers in plant genomics and transcriptomics to access this dataset through a simple R interface. The metadata for each transcriptome project is scraped from the 1KP project website. This metadata includes the species, tissue, and research group for each sequence sample. onekp leverages the taxonomy program taxizedb, a local database version of taxize package, to allow filtering of the metadata by taxonomic group (entered as either a taxon name or NCBI ID). The raw nucleotide or translated peptide sequence can then be downloaded for the full, or filtered, table of transcriptome projects.

Alternatives to onekp

The data may also be accessed directly through CyVerse (previously iPlant). CyVerse efficiently distributes data using the iRODS data system. This approach is preferable for high-throughput cases or in where iRODS is already in play. Further, accessing data straight from the source at CyVerse is more stable than scraping it from project website. However, the onekp R package is generally easier to use (no iRODS dependency or CyVerse API) and offers powerful filtering solutions.

Contact info

1KP staff

R package maintainer

Installation

onekp is on CRAN, but currently is a little out of date. So for now it is better to install through github.

library(devtools)
install_github('ropensci/onekp')

Examples

Retrieve the protein and gene transcript FASTA files for two 1KP transcriptomes:

onekp <- retrieve_onekp()
seqs <- filter_by_code(onekp, c('URDJ', 'ROAP'))
download_peptides(seqs, 'oneKP/pep')
download_nucleotides(seqs, 'oneKP/nuc')

This will create the following directory:

oneKP
 ├── nuc 
 │   ├── ROAP.fna
 │   └── URDJ.fna
 └── pep
     ├── ROAP.faa
     └── URDJ.faa

onekp can also filter by species names, taxon ids, or clade.

# filter by species name
filter_by_species(onekp, 'Pinus radiata')

# filter by species NCBI taxon ID
filter_by_species(onekp, 3347)

# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')

# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)

So to get the protein sequences for all species in Brassicaceae:

onekp <- retrieve_onekp()
seqs <- filter_by_clade(onekp, 'Brassicaceae')
download_peptides(seqs, 'oneKP/pep')
download_nucleotides(seqs, 'oneKP/nuc')

Funding

Development of this R package was supported by the National Science Foundation under Grant No. IOS 1546858.

Contributing

We welcome any contributions!

ropensci_footer