statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

Automated Annotation - January 2019 #483

Open MattHuff opened 5 years ago

MattHuff commented 5 years ago

I've copied the mRNA and polypeptide files generated by the current Automated Annotation protocol to the acf, and they can be found in /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/fasta_012219. I have split the mRNA CDS file into 3400 separate files, and the protein file in to 1000 files.

Going forward, I believe the best option is for each of us to run BLAST on 680 of the 3400 mRNA files. I chose to split this job into two commands, running the first 340 as one job and the second half as its own job. For IPS, use a similar strategy.

Here is my sample code for running swissprot BLAST:

#PBS -N matt_swissprot_1
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-340
#PBS -l nodes=1:ppn=2
#PBS -l walltime=08:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/fasta_012219/splits_mRNA/mRNA.fasta.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/fasta_012219/blast/sprot/mRNA_sprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
patricksis commented 5 years ago

I'll do files 681-1360

RaymondS1 commented 5 years ago

I'll do 1361-2040 (1361-1700 & 1701-2040)

CaseyRichards92 commented 5 years ago

I've got 2041-2720 for cds Swissprot

#PBS -N casey_swissprot_1
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 2041-2720
#PBS -l nodes=1:ppn=2
#PBS -l walltime=08:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_mRNA/mRNA.fasta.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/swissprot/mRNA_sprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5

TrEMBL for 2041-2420

#PBS -N casey_trembl_1
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 2041-2420
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_mRNA/mRNA.fasta.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/trembl/mRNA_sprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5

TrEBML 2421-2720

#PBS -N casey_trembl_2
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 2421-2720
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_mRNA/mRNA.fasta.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/trembl/mRNA_sprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
MattHuff commented 5 years ago

For IPS, I am doing files 1-200. The code I used is as follows:

#PBS -N autoanno_matt_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 1-200
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00

cd $PBS_O_WORKDIR

/lustre/haven/gamma/staton/software/interproscan-5.28-67.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_polypeptide/polypeptide.fasta.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/IPS/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/IPS/TMP \
 > /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/IPS/TMP/$PBS_ARRAYID.out
CaseyRichards92 commented 5 years ago

I will take IPS 201-400

PBS -N casey_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 201-400
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00

cd $PBS_O_WORKDIR

/lustre/haven/gamma/staton/software/interproscan-5.28-67.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_polypeptide/polypeptide.fasta.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp \
 > /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp/$PBS_ARRAYID.out
patricksis commented 5 years ago

I'll do 401--600

CaseyRichards92 commented 5 years ago

Ill do IPS 601-800, actually ill go ahead and finish up 801-999 also

PBS -N casey_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 601-800
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00

cd $PBS_O_WORKDIR

/lustre/haven/gamma/staton/software/interproscan-5.28-67.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_polypeptide/polypeptide.fasta.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp \
 > /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp/$PBS_ARRAYID.out
PBS -N casey_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 801-999
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00

cd $PBS_O_WORKDIR

/lustre/haven/gamma/staton/software/interproscan-5.28-67.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/Matt_fasta_012219/splits_polypeptide/polypeptide.fasta.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp \
 > /lustre/haven/gamma/staton/projects/undergrads/automated_annotation/cricha59/blast/ips/tmp/$PBS_ARRAYID.out
MattHuff commented 5 years ago

IPS is done; what is the status on sprot and trembl? I want to have this done by the end of the week. Has anyone run BLAST on the last 680 files?

patricksis commented 5 years ago

I'll do the last 680 (2721-3400)

patricksis commented 5 years ago

I believe all the files are done. A few files (3383-3400) have no data associated with it for both trembl and swissprot. This may be due to the issue related to Juglans Cathayensis.

patricksis commented 5 years ago

It looks like we split the files two many times, so the last 17 files are empty regardless of any errors.

almasaeed2010 commented 5 years ago

Any progress on this?

MattHuff commented 5 years ago

Currently copying all remaining files to the dev server. It's taking awhile, because there are so many files, and the Trembl xml outputs, in particular, take forever to fully finish. Do you think memory will be an issue for continuing this? I know one of our servers recently hit its memory limit, and I wasn't able to continue loading the XMLs until it was resolved.

I'll update this post once all files are finished loading.

almasaeed2010 commented 5 years ago

When the XMLs are ready, we can upgrade our storage to handle it. However, in the time being, let's keep everything on staton servers if possible.