statonlab / tripal_dev_seed

A minified bioinformatics dataset for seeding Tripal sites
GNU General Public License v3.0
0 stars 2 forks source link

Create annotated set for 1, or 2, more species #48

Closed bradfordcondon closed 6 years ago

bradfordcondon commented 6 years ago

C sativus

mkdir -p src_data/C_sativus

curl ftp://ftp.ncbi.nih.gov/genomes/Cucumis_sativus/GFF/ref_ASM407v2_scaffolds.gff3.gz > src_data/C_sativus/gff.gff.gz

curl ftp://ftp.ncbi.nih.gov/genomes/Cucumis_sativus/protein/protein.fa.gz > src_data/C_sativus/prot.fasta.gz

curl ftp://ftp.ncbi.nih.gov/genomes/Cucumis_sativus/RNA/rna.fa.gz > src_data/C_sativus/mRNA.fasta.gz

gunzip src_data/C_sativus/mRNA.fasta.gz
gunzip src_data/C_sativus/prot.fasta.gz
gunzip src_data/C_sativus/gff.gff.gz
./minify.sh src_data/C_sativus/mRNA.fasta src_data/C_sativus/prot.fasta '(.*)' src_data/C_sativus/gff.gff 100 /db

Hebr

mkdir -p src_data/Hebr
curl https://treegenesdb.org/FTP/Genomes/Hebr/v1.0/annotation/Hebr.1_0.cds.fa.gz > src_data/Hebr/Hebr_1.0_mrna.fasta.gz
curl https://treegenesdb.org/FTP/Genomes/Hebr/v1.0/annotation/Hebr.1_0.gff.gz > src_data/Hebr/Hebr_1.0_gff.gff.gz
curl https://treegenesdb.org/FTP/Genomes/Hebr/v1.0/annotation/Hebr.1_0.peptides.fa.gz > src_data/Hebr/Hebr_1.0_prot.fasta.gz

gunzip src_data/Hebr/Hebr_1.0_mrna.fasta.gz
gunzip src_data/Hebr/Hebr_1.0_prot.fasta.gz
gunzip src_data/Hebr/Hebr_1.0_gff.gff.gz

TransDecoder.LongOrfs -t src_data/Hebr/Hebr_1.0_mrna.fasta
mv src_data/Hebr/Hebr_1.0_mrna.fasta.transdecoder_dir/longest_orfs.pep src_data/Hebr/Hebr_1.0_prot.fasta

We use transdecoder because otherwise we might not be able to get the mRNA name from the polypeptide name. This is problematic when loading in annotations, which must be linked to the parent feature via regular expression.

./minify.sh  \
 src_data/Hebr/Hebr_1.0_mrna.fasta \
 src_data/Hebr/Hebr_1.0_prot.fasta\
 '(.*?)\.p'  \
  src_data/Hebr/Hebr_1.0_gff.gff\
  Name\
   200\
    /db

mv out Hebr_mini
./annotate.sh \
Hebr_mini/sequences/mrna_mini.fasta \
Hebr_mini/sequences/polypeptide_mini.fasta \
/fake/db/path \
Hebr

mv out/* Hebr_mini/
rm -r out
bradfordcondon commented 6 years ago

hi all: if anyone wants to contribute anotehr minified data ill gladly host it. For now i'm very happy with the one, so closing.