statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

Pistacia vera transcriptom assembly #528

Open CaseyRichards92 opened 5 years ago

CaseyRichards92 commented 5 years ago

Publication and Data Information

Refer here for Pistacia transcriptome assembly. This issue will pertain only to the loading to the live site. https://github.com/mestato/statonlabprivate/wiki/Pistacia-vera-transcriptom-assembly-(Hardwood)

Additional Information

Short Read Archive (SRA) database accession number is: SSR: SRX1880621

The link is https://www.ncbi.nlm.nih.gov/sra/?term=SRX1880621

Checklist

See New Genome Documentation for detailed instructions.

CaseyRichards92 commented 5 years ago

Live Site

Organism created https://www.hardwoodgenomics.org/organism/Pistacia/Vera Publication https://www.hardwoodgenomics.org/Publication/3472001 Transcriptome assembly https://www.hardwoodgenomics.org/Transcriptome-assembly/3472002?tripal_pane=group_summary_tripalpane w/ Downloads Swissprot Annotation https://www.hardwoodgenomics.org/BLAST-annotation/3472003 Trembl Annotation https://www.hardwoodgenomics.org/BLAST-annotation/3472004 IPS Annotation https://www.hardwoodgenomics.org/InterProScan-annotation/3472005 KEGG Annotation https://www.hardwoodgenomics.org/KEGGresults/3472006 CDS FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793452 PEPTIDE FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793465 KEGG FASTA Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793456 Swissprot xml loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800741 Published mRNA- polypeptide https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793466

CaseyRichards92 commented 5 years ago

@almasaeed2010 as discussed in the meeting here is the swissprot error I am receiving https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/793726

almasaeed2010 commented 5 years ago

looking into it

CaseyRichards92 commented 5 years ago

@almasaeed2010 you might want to wait for @mestato to reply if Ahmed did the right transcriptome. We might have to restart it.

almasaeed2010 commented 5 years ago

This is your error

Database query failed when searching for feature     [error]
'TRINITY_DN48859_c0_g1::TRINITY_DN48859_c0_g1_i2::g.62581::m.62581
TRINITY_DN48859_c0_g1::TRINITY_DN48859_c0_g1_i2::g.62581 ORF
type:3prime_partial len:286 (-) TRINITY_DN48859_c0_g1_i2:2-856(-)'.

this basically means that the names in your IPS files don't match the ones in the DB. It sounds like you didn't clean up the peptides files before running IPS.

We can wait for @mestato to confirm the files before rerunning though.

CaseyRichards92 commented 5 years ago

@almasaeed2010 I checked with Meg and this transcriptome is good so we will be using this data

almasaeed2010 commented 5 years ago

/remind me to check Blast files today at 2PM EST

reminders[bot] commented 5 years ago

@almasaeed2010 set a reminder for Jul 17th 2019

almasaeed2010 commented 5 years ago

Looks like the BLAST loader allows for a regular expression to extract the name. This makes our job slightly easier. Can you try loading swissprot with the following expression?

(.*?)

CaseyRichards92 commented 5 years ago

Blast re-ran with reg-ex https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800051 Re-ran BLAST w/ reg-exp ^(.*?) T https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800700

reminders[bot] commented 5 years ago

:wave: @almasaeed2010, check Blast files

CaseyRichards92 commented 5 years ago

@almasaeed2010 the Reg-exp did not work for swissprot

almasaeed2010 commented 5 years ago

try this expression ^(.*?) T

almasaeed2010 commented 5 years ago

@cricha59 I edited the files directly. There are now 2 sets of files. ones that have .old as an extension and ones that have .xml. If you rerun the job without any regular expressions, it should work.

almasaeed2010 commented 5 years ago

Looks like the organism doesn't have features loaded. Try using the .clean CDS file to load the sequences then try reloading blast.

CaseyRichards92 commented 5 years ago

New CDS FASTA w/ c.clean cds https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809705

CaseyRichards92 commented 5 years ago

New Blast Job with new xmls https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809706

CaseyRichards92 commented 5 years ago

New Peptide loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809707

almasaeed2010 commented 5 years ago

We had to cancel the jobs for the CDS, Blast and Peptides because they deadlocked.

Technical note: It looks like Tripal was unable to handle the large CDS file. Possibly because it's running in a transaction. We might have to look more into this before resubmitting the job.

CaseyRichards92 commented 5 years ago

Find a better Trinity assembly and collapse into a smaller size

CaseyRichards92 commented 5 years ago

Author emailed for their Trinity assembly

CaseyRichards92 commented 5 years ago

Moving forward with CAP3 or cd-hit-est.

CaseyRichards92 commented 5 years ago

Use Trinity assembly for cd-hit-est

CaseyRichards92 commented 5 years ago

Followed cd-hit-est guide. To generate filtered FASTA. http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf

/staton/software/cd-hit-v4.6.6-2016-0711/cd-hit-est -i Trinity.fasta -o Trinity_cd.fasta -c 0.95 -n 8

CaseyRichards92 commented 5 years ago

Transdecoder /staton/software/TransDecoder-3.0.0/TransDecoder.LongOrfs -t /staton/projects/aelshaar/assembly/4_Trinity/Trinity_cd.fasta

CaseyRichards92 commented 5 years ago

Run blast then run max_target_seqs

https://gist.github.com/almasaeed2010/702a7a79021be496bd6a5d79954395ea

CaseyRichards92 commented 5 years ago

Blast refinement

 -query Trinity.fasta.transdecoder_dir/longest_orfs.pep \
 -db /staton/libraries/uniprot/uniprot_sprot.fasta \
 -max_target_seqs 1 \
 -outfmt 6 \
 -evalue 1e-5 \
 -num_threads 10 > blastp.outfmt6

Transdecoder rd 2

/staton/software/TransDecoder-3.0.0/TransDecoder.Predict -t Trinity_cd.fasta --retain_blastp_hits Trinity_cd.fasta.transdecoder_dir/blastp.outfmt6
CaseyRichards92 commented 5 years ago

Recieve error:

Apprixmated maximum memory consumption: 112M
writing new database
writing clustering information
program completed !

Total CPU time 1030.17
CMD: /staton/software/TransDecoder-3.0.0/util/get_top_longest_fasta_entries.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_longest_5000.nr80 500 > Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest
CMD: touch Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest.ok
CMD: /staton/software/TransDecoder-3.0.0/util/seq_n_baseprobs_to_logliklihood_vals.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.top_500_longest Trinity_cd.fasta.transdecoder_dir/base_freqs.dat > Trinity_cd.fasta.transdecoder_dir/hexamer.scores
CMD: touch Trinity_cd.fasta.transdecoder_dir/hexamer.scores.ok
CMD: /staton/software/TransDecoder-3.0.0/util/score_CDS_liklihood_all_6_frames.pl Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds Trinity_cd.fasta.transdecoder_dir/hexamer.scores > Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.scores
CMD: touch Trinity_cd.fasta.transdecoder_dir/longest_orfs.cds.scores.ok
Error, cannot find file blastp.outfmt6 at /staton/software/TransDecoder-3.0.0/TransDecoder.Predict line 382.
CaseyRichards92 commented 5 years ago

@almasaeed2010 any idea why transdecoder rd 2 didnt create a cds and pep files?

CaseyRichards92 commented 5 years ago

@almasaeed2010 disregard. issue fixed.

UPDATED TRANSDECODER RD 2

/staton/software/TransDecoder-3.0.0/TransDecoder.Predict -t Trinity_cd.fasta --retain_blastp_hits Trinity_cd.fasta.transdecoder_dir/blastp.outfmt6
CaseyRichards92 commented 5 years ago

BLAST

#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=06:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/swissprot/pistacio_swissprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 201-400
#PBS -l nodes=1:ppn=2
#PBS -l walltime=06:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/swissprot/pistacio_swissprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
CaseyRichards92 commented 5 years ago

Trembl

#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/trembl/pistacio_trembl.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 201-400
#PBS -l nodes=1:ppn=2
#PBS -l walltime=15:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/pistacio/cds_splits/pistacio.cds.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/pistacio/blast/trembl/pistacio_trembl.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
CaseyRichards92 commented 5 years ago

Ran max target seqs for swissprot

https://gist.github.com/almasaeed2010/702a7a79021be496bd6a5d79954395ea

CaseyRichards92 commented 5 years ago

Ran max target seqs for trembl. Same code as above for swissprot. nohup python max_target_seqs.py *.xml &

CaseyRichards92 commented 5 years ago

IPS

#PBS -N pistacio_ips
#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 1-200
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=3:30:00

cd $PBS_O_WORKDIR

module load python3

/lustre/haven/gamma/staton/software/interproscan-5.34-73.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/pistacio/pep_splits/pistacio_noAst.pep.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/TMP \
 > /lustre/haven/gamma/staton/projects/undergrads/pistacio/ips/TMP/$PBS_ARRAYID.out
CaseyRichards92 commented 5 years ago

New CDS FASTA LOADER https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828377 New PEP FASTA LOADER https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828379 New published polypeptides https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828380 New BLAST Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828385

CaseyRichards92 commented 5 years ago

New Trembl job https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828672

CaseyRichards92 commented 5 years ago

IPS Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828676

CaseyRichards92 commented 5 years ago

KEGG Loader https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/828677

CaseyRichards92 commented 5 years ago

BLAST DB CDS https://www.hardwoodgenomics.org/content/pistacia-vera-transcripts BLAST DB PEP https://www.hardwoodgenomics.org/content/pistacia-vera-peptides

CaseyRichards92 commented 5 years ago

@mestato I want to make sure I am accurately describing our methods for pistacio. Also, should I change "we" to be more specific to us or is it obvious? I have provided hyperlinks in the descriptions(refer to link).

https://www.hardwoodgenomics.org/Transcriptome-assembly/3472002?tripal_pane=group_description_download

"Due to size of Trinity Assembly we further processed the assembly using CD-HIT-EST tool following the authors methods in their paper. Our methods slightly differed from theirs as we processed our assembly at 95% identity level compared to 99% to reduce redundancy and remove identical fragments. The processed assembly we created did not need to be exposed to the CAP3 program, we instead ran max target seqs after running our BLAST and TrEMBL analysis."