The 1000 Plants initiative
(1KP) provides the
transcriptome sequences to over 1000 plants from diverse lineages. onekp
allows researchers in plant genomics and transcriptomics to access this dataset
through a simple R interface. The metadata for each transcriptome project is
scraped from the 1KP project website. This metadata includes the species,
tissue, and research group for each sequence sample. onekp
leverages the
taxonomy program taxizedb
, a local database version of taxize
package, to
allow filtering of the metadata by taxonomic group (entered as either a taxon
name or NCBI ID). The raw nucleotide or translated peptide sequence can then be
downloaded for the full, or filtered, table of transcriptome projects.
onekp
The data may also be accessed directly through CyVerse (previously iPlant).
CyVerse efficiently distributes data using the iRODS data system. This approach
is preferable for high-throughput cases or in where iRODS is already in play.
Further, accessing data straight from the source at CyVerse is more stable than
scraping it from project website. However, the onekp
R package is generally
easier to use (no iRODS dependency or CyVerse API) and offers powerful
filtering solutions.
1KP staff
Gane Ka-Shu Wong - Principal investigator
Michael Deyholos - Alberta co-investigator
Yong Zhang - Shenzhen co-investigator
Eric Carpenter - Database manager
R package maintainer
onekp
is on CRAN, but currently is a little out of date. So for now it is
better to install through github.
library(devtools)
install_github('ropensci/onekp')
Retrieve the protein and gene transcript FASTA files for two 1KP transcriptomes:
onekp <- retrieve_onekp()
seqs <- filter_by_code(onekp, c('URDJ', 'ROAP'))
download_peptides(seqs, 'oneKP/pep')
download_nucleotides(seqs, 'oneKP/nuc')
This will create the following directory:
oneKP
├── nuc
│ ├── ROAP.fna
│ └── URDJ.fna
└── pep
├── ROAP.faa
└── URDJ.faa
onekp
can also filter by species names, taxon ids, or clade.
# filter by species name
filter_by_species(onekp, 'Pinus radiata')
# filter by species NCBI taxon ID
filter_by_species(onekp, 3347)
# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')
# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)
So to get the protein sequences for all species in Brassicaceae:
onekp <- retrieve_onekp()
seqs <- filter_by_clade(onekp, 'Brassicaceae')
download_peptides(seqs, 'oneKP/pep')
download_nucleotides(seqs, 'oneKP/nuc')
Development of this R package was supported by the National Science Foundation under Grant No. IOS 1546858.
We welcome any contributions!