statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

Hazelnut (Corylus avellana) genome #480

Closed mestato closed 4 years ago

mestato commented 5 years ago

Publication and Data Information

https://www.biorxiv.org/content/10.1101/469015v1

Additional Information

Checklist

See New Genome Documentation for detailed instructions.

patricksis commented 5 years ago

Couldn't find any peptide files http://hazelnut.data.mocklerlab.org/

bradfordcondon commented 5 years ago

@patricksis http://hazelnut.data.mocklerlab.org/C.avellana_transcriptome_Jefferson_ORFs_final.fasta?__sw_csrfToken=m0gHTdSFmHjjw669LSmonvbomgoq7Lqg

patricksis commented 5 years ago

Dev Site Organism page: https://hardwoods.ag.utk.edu/organism/Corylus/avellana Publication: https://hardwoods.ag.utk.edu/Publication/2803350 Reference Genome: https://hardwoods.ag.utk.edu/Genome-assembly/2803351 InterProScan annotation: https://hardwoods.ag.utk.edu/InterProScan-annotation/2803352 SwissProt annotation: https://hardwoods.ag.utk.edu/BLAST-annotation/2803353 Trembl annotation: https://hardwoods.ag.utk.edu/BLAST-annotation/2803354 Loader job for CDS: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/534625

patricksis commented 5 years ago

Most likely need a regular expression for the peptide file, ex >Corav.1 -3 834 2144, were the cds has nothing >Corav.1

patricksis commented 5 years ago

Peptides loader job: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/505954

Regex used >(Corav\.\d+)?

patricksis commented 5 years ago

Publish tribal content: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/505955

patricksis commented 5 years ago

The webpage currently hosting the gff file is down for some reason, will check back later

patricksis commented 5 years ago

ACF trembl


#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=12:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/hazelnut/raw_data/BLAST_split/C_avellana_cds.fasta.#PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_trembl_plants_July_2018.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/hazelnut/BLAST/trembl/C_avellana_trembl_$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5 ```
patricksis commented 5 years ago

Swissprot

#PBS -S /bin/bash
#PBS -j oe
#PBS -A ACF-UTK0011
#PBS -t 1-200
#PBS -l nodes=1:ppn=2
#PBS -l walltime=05:00:00

cd $PBS_O_WORKDIR

module load blast

blastx \
 -query /lustre/haven/gamma/staton/projects/undergrads/hazelnut/raw_data/BLAST_split/C_avellana_cds.fasta.$PBS_ARRAYID \
 -db /lustre/haven/gamma/staton/library/uniprot/uniprot_sprot.fasta \
 -out /lustre/haven/gamma/staton/projects/undergrads/hazelnut/BLAST/swissprot/C_avellana_sprot.$PBS_ARRAYID.xml \
 -evalue 1e-5 \
 -outfmt 5
patricksis commented 5 years ago

Ips

#PBS -A ACF-UTK0011
#PBS -S /bin/bash
#PBS -t 1-200
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=4:30:00

cd $PBS_O_WORKDIR

/lustre/haven/gamma/staton/software/interproscan-5.28-67.0/interproscan.sh \
 -i /lustre/haven/gamma/staton/projects/undergrads/hazelnut/raw_data/ips_split/C_avellana_peptides.fasta.$PBS_ARRAYID \
 -f XML \
 -d /lustre/haven/gamma/staton/projects/undergrads/hazelnut/ips/xmls \
 --disable-precalc \
 --iprlookup \
 --goterms \
 --pathways \
 --tempdir /lustre/haven/gamma/staton/projects/undergrads/hazelnut/ips/TMP \
 > /lustre/haven/gamma/staton/projects/undergrads/hazelnut/ips/TMP/$PBS_ARRAYID.out
patricksis commented 5 years ago

Trembl xml import job: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/507088

patricksis commented 5 years ago

Swissprot xml import job: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/507089

patricksis commented 5 years ago

IPS xml import: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/522388

patricksis commented 5 years ago

@almasaeed2010 I can't seem to get the IPS results to show up, could you take a look at it for me when you have the time?

almasaeed2010 commented 5 years ago

@patricksis I cleaned up the polypeptides file for you so I think it should work now if you rerun IPS on ACF. Here is the file you should use on the dev server:

/var/www/html/sites/default/files/sequences/hazelnut/raw_files/C_avellana_peptides_matt.clean.fasta 
patricksis commented 5 years ago

@almasaeed2010 thank you

almasaeed2010 commented 5 years ago

The deletion job is running here: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/534619

Once the job is done, you can start re-submitting the data

patricksis commented 5 years ago

@mestato The webpage hosting the files has been down for weeks, I need access to the original polypeptide file as well as the GFF file, I attempted to contact them, but sadly I haven't received a response. Website: https://www.cavellanagenomeportal.com/.

patricksis commented 5 years ago

@almasaeed2010 The page hosting the peptide file is back up! The original peptide file is located here:

/var/www/html/sites/default/files/sequences/hazelnut/raw_files/C.avellana_transcriptome_Jefferson_ORFs_final.fasta

patricksis commented 5 years ago

Re-upload: cds fasta: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/542817 peptide: https://hardwoods.ag.utk.edu/admin/tripal/tripal_jobs/view/542818

almasaeed2010 commented 5 years ago

Here is the error we get when loading the peptides:

Cannot find a unique feature for the parent 'Corav.471' of type 'mRNA' for the feature.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find a unique feature for the parent 'Corav.471' of type 'mRNA' for the feature.
Cannot find a unique feature for the parent 'Corav.471' of type 'mRNA' for the feature.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find a unique feature for the parent 'Corav.471' of type 'mRNA' for the feature.
Cannot find a unique feature for the parent 'Corav.1774' of type 'mRNA' for the feature.
[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot find a unique feature for the parent 'Corav.1774' of type 'mRNA' for the feature.

I think the CDS file is malformed.

Here is what shows up when we grep 471 in the CDS file.

$ cat C_avellana_cds.fasta | grep "\.471"
caccagctctgcaagaacccaaggcc.471
>Corav.4710
>Corav.4711
>Corav.4712
>Corav.4713
>Corav.4714
>Corav.4715
>Corav.4716
>Corav.4717
>Corav.4718
>Corav.4719

Notice the first line where a sequence ends with the numbers!

Trying the same on the peptides file:

$ cat C.avellana_transcriptome_Jefferson_ORFs_final.fasta | grep "\.471"
>Corav.471  -1  148 762
>Corav.4710 -1  1   87
>Corav.4711 -2  2   175
>Corav.4712 -3  39  431
>Corav.4713 +1  295 1287
>Corav.4714 -1  13  1230
>Corav.4715 +2  149 1225
>Corav.4716 -1  1   261
>Corav.4717 -2  407 568
>Corav.4718 -1  10  243
>Corav.4719 -3  117 1013

In the peptides file, we find the feature as expected.

patricksis commented 5 years ago

I think I might have found a working file @almasaeed2010

$ cat C.avellana_transcriptome_Jefferson_CDS.fasta | grep "\.471"
>Corav.471
>Corav.4710
>Corav.4711
>Corav.4712
>Corav.4713
>Corav.4714
>Corav.4715
>Corav.4716
>Corav.4717
>Corav.4718
>Corav.4719

I think this may have been the original file we used, but we may have edited it.

almasaeed2010 commented 5 years ago

looks fixed to me 👍

Try reloading it.

patricksis commented 5 years ago

Going to try and load this organism to the live site using a different cds file.

Organsim: https://www.hardwoodgenomics.org/organism/Corylus/avellana Publication: https://www.hardwoodgenomics.org/Publication/3472009 Reference Genome: https://www.hardwoodgenomics.org/Genome-assembly/3472010 InterProScan annotation: https://www.hardwoodgenomics.org/InterProScan-annotation/3472011 SwissProt annotation: https://www.hardwoodgenomics.org/BLAST-annotation/3472012 Trembl annotation: https://www.hardwoodgenomics.org/BLAST-annotation/3472013 Chado cds loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800435 Chado peptide loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800439 Publish tripal content: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800446 KEGG annotation: https://www.hardwoodgenomics.org/KEGGresults/3500181 KEGG loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/800696 Trembl loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/806726 Swissprot loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/806730 IPS loader: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/806731

patricksis commented 5 years ago

blast db (cds): https://www.hardwoodgenomics.org/content/corylus-avellana-transcripts blast db (peptides): https://www.hardwoodgenomics.org/content/corylus-avellana-peptides

patricksis commented 5 years ago

No Gene Ontology/KEGG browser on organism page

patricksis commented 5 years ago

@almasaeed2010 Any reason that Gene Ontology/KEGG browser not showing up?

almasaeed2010 commented 5 years ago

Probably needs reindexing. I'll run it now. May take a few hours though.

patricksis commented 5 years ago

Thanks, It didn't look like any problem with cds/peptide files to me, so I wasn't too sure.

patricksis commented 5 years ago

JBrowse instance: https://www.hardwoodgenomics.org/admin/tripal/tripal_jobs/view/807913

patricksis commented 5 years ago

JBrowse link has also been added. This organism should just be about done.

patricksis commented 5 years ago

blast db (scaffolds): https://www.hardwoodgenomics.org/content/corylus-avellana-scaffolds

patricksis commented 5 years ago

@cricha59 @RaymondS1 Can either of you go over this so I can close the issue.

CaseyRichards92 commented 4 years ago

@patricksis only took me 6 months but everything is there. You can close.

patricksis commented 4 years ago

Lol thanks @cricha59. Closing.