Open CaseyRichards92 opened 5 years ago
Organism created https://www.hardwoodgenomics.org/organism/Pistacia/Vera Publication https://www.hardwoodgenomics.org/Publication/3472001 Transcriptome assembly https://www.hardwoodgenomics.org/Transcriptome-assembly/3472002?tripal_pane=group_summary_tripalpane w/ Downloads Swissprot Annotation https://www.hardwoodgenomics.org/BLAST-annotation/3472003 Trembl Annotation https://www.hardwoodgenomics.org/BLAST-annotation/3472004 IPS Annotation https://www.hardwoodgenomics.org/InterProScan-annotation/3472005 KEGG Annotation https://www.hardwoodgenomics.org/KEGGresults/3472006 CDS FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793452 PEPTIDE FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793465 KEGG FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793456 Swissprot xml loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800741 Published mRNA- polypeptide https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793466
@almasaeed2010 as discussed in the meeting here is the swissprot error I am receiving https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793726
looking into it
@almasaeed2010 you might want to wait for @mestato to reply if Ahmed did the right transcriptome. We might have to restart it.
This is your error
Database query failed when searching for feature [error]
'TRINITY_DN48859_c0_g1::TRINITY_DN48859_c0_g1_i2::g.62581::m.62581
TRINITY_DN48859_c0_g1::TRINITY_DN48859_c0_g1_i2::g.62581 ORF
type:3prime_partial len:286 (-) TRINITY_DN48859_c0_g1_i2:2-856(-)'.
this basically means that the names in your IPS files don't match the ones in the DB. It sounds like you didn't clean up the peptides files before running IPS.
We can wait for @mestato to confirm the files before rerunning though.
@almasaeed2010 I checked with Meg and this transcriptome is good so we will be using this data
/remind me to check Blast files today at 2PM EST
@almasaeed2010 set a reminder for Jul 17th 2019
Looks like the BLAST loader allows for a regular expression to extract the name. This makes our job slightly easier. Can you try loading swissprot with the following expression?
(.*?)
Blast re-ran with reg-ex https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800051
Re-ran BLAST w/ reg-exp ^(.*?) T
https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800700
:wave: @almasaeed2010, check Blast files
@almasaeed2010 the Reg-exp did not work for swissprot
try this expression ^(.*?) T
@cricha59 I edited the files directly. There are now 2 sets of files. ones that have .old
as an extension and ones that have .xml
. If you rerun the job without any regular expressions, it should work.
Looks like the organism doesn't have features loaded. Try using the .clean
CDS file to load the sequences then try reloading blast.
New CDS FASTA w/ c.clean cds https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809705
New Blast Job with new xmls https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809706
New Peptide loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809707
We had to cancel the jobs for the CDS, Blast and Peptides because they deadlocked.
Technical note: It looks like Tripal was unable to handle the large CDS file. Possibly because it's running in a transaction. We might have to look more into this before resubmitting the job.
Find a better Trinity assembly and collapse into a smaller size
Author emailed for their Trinity assembly
Moving forward with CAP3 or cd-hit-est.
Use Trinity assembly for cd-hit-est
Followed cd-hit-est guide. To generate filtered FASTA. http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf
/staton/software/cd-hit-v4.6.6-2016-0711/cd-hit-est -i Trinity.fasta -o Trinity_cd.fasta -c 0.95 -n 8
Transdecoder
/staton/software/TransDecoder-3.0.0/TransDecoder.LongOrfs -t /staton/projects/aelshaar/assembly/4_Trinity/Trinity_cd.fasta
Run blast then run max_target_seqs
https://gist.github.com/almasaeed2010/702a7a79021be496bd6a5d79954395ea
Blast refinement
-query Trinity.fasta.transdecoder_dir/longest_orfs.pep \
-db /staton/libraries/uniprot/uniprot_sprot.fasta \
-max_target_seqs 1 \
-outfmt 6 \
-evalue 1e-5 \
-num_threads 10 > blastp.outfmt6
Transdecoder rd 2
/staton/software/TransDecoder-3.0.0/TransDecoder.Predict -t Trinity_cd.fasta --retain_blastp_hits Trinity_cd.fasta.transdecoder_dir/blastp.outfmt6
Recieve error:
Apprixmated maximum memory consumption: 112M
writing new database
writing clustering information
program completed !
Total CPU time 1030.17
CMD: /staton/software/TransDecoder-3.0.0/util/get_top_longest_fasta_entries.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_longest_5000.nr80 500 > Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest
CMD: touch Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest.ok
CMD: /staton/software/TransDecoder-3.0.0/util/seq_n_baseprobs_to_logliklihood_vals.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest Trinity_cd.fasta.transdecoder_dir/base_freqs.dat > Trinity_cd.fasta.transdecoder_dir/hexamer.scores
CMD: touch Trinity_cd.fasta.transdecoder_dir/hexamer.scores.ok
CMD: /staton/software/TransDecoder-3.0.0/util/score_CDS_liklihood_all_6_frames.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds Trinity_cd.fasta.transdecoder_dir/hexamer.scores > Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.scores
CMD: touch Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.scores.ok
Error, cannot find file blastp.outfmt6 at /staton/software/TransDecoder-3.0.0/TransDecoder.Predict line 382.
@almasaeed2010 any idea why transdecoder rd 2 didnt create a cds and pep files?
@almasaeed2010 disregard. issue fixed.
UPDATED TRANSDECODER RD 2
/staton/software/TransDecoder-3.0.0/TransDecoder.Predict -t Trinity_cd.fasta --retain_blastp_hits Trinity_cd.fasta.transdecoder_dir/blastp.outfmt6
BLAST
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=06:00:00
cd $PBS_O_WORKDIR
module load blast
blastx \
-query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
-db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
-out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/swissprot/pistacio_swissprot.$PBS_ARRAYID.xml \
-evalue 1e-5 \
-outfmt 5
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 201-400
#PBS -l nodes=1:ppn=2
#PBS -l walltime=06:00:00
cd $PBS_O_WORKDIR
module load blast
blastx \
-query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
-db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
-out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/swissprot/pistacio_swissprot.$PBS_ARRAYID.xml \
-evalue 1e-5 \
-outfmt 5
Trembl
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00
cd $PBS_O_WORKDIR
module load blast
blastx \
-query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
-db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
-out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/trembl/pistacio_trembl.$PBS_ARRAYID.xml \
-evalue 1e-5 \
-outfmt 5
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 201-400
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00
cd $PBS_O_WORKDIR
module load blast
blastx \
-query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
-db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
-out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/trembl/pistacio_trembl.$PBS_ARRAYID.xml \
-evalue 1e-5 \
-outfmt 5
Ran max target seqs for swissprot
https://gist.github.com/almasaeed2010/702a7a79021be496bd6a5d79954395ea
Ran max target seqs for trembl. Same code as above for swissprot.
nohup python max_target_seqs.py *.xml &
IPS
#PBS -N pistacio_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 1-200
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00
cd $PBS_O_WORKDIR
module load python3
/lustre/haven/gamma/staton/software/interproscan-5.34-73.0/interproscan.sh \
-i /lustre/haven/gamma/staton/projects/undergrads/pistacio/pep_splits/pistacio_noAst.pep.$PBS_ARRAYID \
-f XML \
-d /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/xmls \
--disable-precalc \
--iprlookup \
--goterms \
--pathways \
--tempdir /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/TMP \
> /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/TMP/$PBS_ARRAYID.out
New CDS FASTA LOADER https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828377 New PEP FASTA LOADER https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828379 New published polypeptides https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828380 New BLAST Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828385
@mestato I want to make sure I am accurately describing our methods for pistacio. Also, should I change "we" to be more specific to us or is it obvious? I have provided hyperlinks in the descriptions(refer to link).
"Due to size of Trinity Assembly we further processed the assembly using CD-HIT-EST tool following the authors methods in their paper. Our methods slightly differed from theirs as we processed our assembly at 95% identity level compared to 99% to reduce redundancy and remove identical fragments. The processed assembly we created did not need to be exposed to the CAP3 program, we instead ran max target seqs after running our BLAST and TrEMBL analysis."
Publication and Data Information
Refer here for Pistacia transcriptome assembly. This issue will pertain only to the loading to the live site. https://github.com/mestato/statonlabprivate/wiki/Pistacia-vera-transcriptom-assembly-(Hardwood)
Additional Information
Short Read Archive (SRA) database accession number is: SSR: SRX1880621
The link is https://www.ncbi.nlm.nih.gov/sra/?term=SRX1880621
Checklist
See New Genome Documentation for detailed instructions.