statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

Fagus sylvatica transcriptome #50

Open mestato opened 6 years ago

mestato commented 6 years ago

There are actually two public transcritomes.

  1. De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech Raw sequence reads were submitted to the Sequence Read Archive (SRA) of NCBI under the accession number SRP100976. The de novo transcriptome assembly is available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.8k3q2.

  2. A unigene set for European beech (Fagus sylvatica L.) and its use to decipher the molecular mechanisms involved in dormancy regulation Sanger sequences were deposited in the NCBI database (NCBI Accession: LIBEST_026606 WZOAFSAA and LIBEST_026417 WZOAFSCA). Roche 454 raw sequence reads have also been submitted to the Sequence Read Archive of NCBI (Accession no. SRX105749, SRX105748, SRX105747, SRX105746 and SRX105663). Assembled contigs (n = 21 057) can be downloaded from: http://ngspipelines.toulouse.inra.fr:9000/.

Both have quite interesting associated samples for expression data. But our database isn't really built to support more than one transcript set for an organism, so I think we have to pick one... need to discuss.

bradfordcondon commented 6 years ago

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0184167

mestato commented 6 years ago

genome just published! So no need to load these transcriptomes as independent analyses. Instead, after loading the geome (#298), then these can be reanalyzed as expression data against the gene models.

CaseyRichards92 commented 5 years ago

There wasnt a kegg annotaion or upload. Kegg annotation: https://www.hardwoodgenomics.org/KEGGresults/2843415

RaymondS1 commented 5 years ago

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP100976 This can tell us what sample from "De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech" was from what day

The plant ID determines which ones are the control and which ones are the treated The following samples are control samples: 14/16/21/36/13

These samples are the stressed ones: 43/45/46/48/51

RaymondS1 commented 5 years ago

Fastqc

for f in ../../raw_data/*.fastq
do
    filename=$(basename "$f")
    base="${filename%%.fastq*}"
    echo "filename $filename base $base"
    mkdir $base.fastQC

    /staton/software/FastQC-v0.11.5/FastQC/fastqc -o $base.fastQC $f >& $base.fastQC.out
done

Skewer

for R1 in for R1 in /staton/projects/undergrads/fagus_sylvatica/raw_data/*_1.fastq
do
 R2=`sed 's/_1/_2/' <(echo $R1)`
 BASE=$( basename $R1 | sed 's/_1.fastq*//g')
 echo "R1 $R1"
 echo "R2 $R2"
 echo "BASE $BASE"

 /staton/software/skewer/skewer \
 -x /staton/software/Trimmomatic-0.38/adapters/all.fa \
 -l 30 \
 $R1 \
 $R2 \
 -o $BASE \
 >& $BASE.trim_output &

done

Indexing

/staton/software/STAR-2.6.1a/bin/Linux_x86_64/STAR \
 --runMode genomeGenerate \
 --genomeDir genomeDir \
 --genomeFastaFiles /staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_genome.fasta \
 --sjdbGTFfile  /staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_cds_v1.3.gff3 \
 --sjdbGTFtagExonParentTranscript Parent \
 --sjdbOverhang 100 &

Alignment (A variation of)

cat /staton/projects/undergrads/fagus_sylvatica/assembly/star/file_list1.txt | while read line
do
 BASE=$( basename $line | sed 's/.log*//g')
 echo "BASE $BASE"

 /staton/software/STAR-2.6.1a/STARlong \
  --genomeDir genomeDir \
  --readFilesIn ../skewer/$BASE-pair1.fastq ../skewer/$BASE-pair2.fastq \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix $BASE. &

done

HTSeq

cat /staton/projects/undergrads/fagus_sylvatica/assembly/star/drought/.bam/bam_list1.sh | while read bam
do
 base=$( basename $bam | sed 's/.sorted.bam//g')

 echo "bam $bam"
 echo "base $base"
 echo "--"

 /staton/software/htseq-count \
  --format=bam \
  --order=pos \
  --stranded=no \
  --type=gene \
  --idattr=ID \
  /staton/projects/undergrads/fagus_sylvatica/assembly/star/drought/.bam/$bam \
  /staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_genome.fasta.gff3 \
  >$base.counts.txt \
  2> $base.out &

echo "-------"

done
RaymondS1 commented 5 years ago

Biosample Upload https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809141

Biosample Published https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809142

Uploaded Biomaterial data sample https://hardwoodgenomics.org/biologicalsample/3500217?tripal_pane=group_summary_tripalpane

RaymondS1 commented 5 years ago

Biosamples published https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809134

Fagus sylvatica Expression analysis https://hardwoodgenomics.org/Analysis/3500240

Normalization Upload https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809922

RaymondS1 commented 5 years ago

@almasaeed2010 The expression heatmap gives the error There is no expression data available for the features you entered. I followed this wiki https://tripal-devseed.readthedocs.io/en/latest/loading_expression_data.html to upload it and used transcript for the sequence type.

The link to the normalization upload is up above.

almasaeed2010 commented 5 years ago

This is the error that caused the failure. Unfortunately, tripal didn't report that an error happened 😞

from European Beech for Fagus sylvatica Expression ERROR (TRIPAL_ANALYSIS_EXPRESSION): 

Could not copy /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm to public://expression/5d4c7313ac449_tpm.

[site http://default] [TRIPAL ERROR] [TRIPAL_ANALYSIS_EXPRESSION] 
Could not copy /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm to public://expression/5d4c7313ac449_tpm.

ERROR: Failed to cache file /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm
RaymondS1 commented 5 years ago

https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/812733 Normalization rerun

almasaeed2010 commented 5 years ago

The files seem to have a different transcript name than the fasta files.

[TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database.       Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
MattHuff commented 5 years ago

We figured out what was going on; STAR produces a GFF file containing only the information on where each scaffold begins and ends, and that was accidentally used instead of the correct GFF3 file. Ray is rerunning HTSeq using the correct GFF3 file, which contains the gene IDs.

RaymondS1 commented 5 years ago

https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/813904 Normalization re-upload

RaymondS1 commented 5 years ago

Write a regex that adds 01 at the end.

almasaeed2010 commented 5 years ago

I have a solution that will add the 01 to each gene name using the matrix.php script. @RaymondS1 remind me to fix the script the next time you are working please.

RaymondS1 commented 5 years ago

https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/820426 Normalization re-run

RaymondS1 commented 5 years ago
RaymondS1 commented 5 years ago

Expression Uploaded https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/825751

Edit: Upload error

Failed to cache file /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm/tpm/results.tsv

@almasaeed2010

almasaeed2010 commented 5 years ago

This is caused probably by permissions issues. @jwest60 since you developed the caching feature for expression files, could you take a look at the code to see what might be causing this? The error only show the file that is supposed to be cached but doesn't say where it is getting copied to. Try to find out where that folder is and let me know. I can attempt to fix the permissions from there.

jwest60 commented 5 years ago

The files should be getting cached in sites/default/files/expression.

almasaeed2010 commented 5 years ago

Expression shows up now.

https://hardwoodgenomics.org/bio_data/2476482?tripal_pane=group_expression

RaymondS1 commented 5 years ago

https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/830308 Biosample reupload

https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/831063 Expression reupload

RaymondS1 commented 5 years ago

@cricha59 Can you look over this organism so i can go ahead and close it

RaymondS1 commented 4 years ago

P=Values upload https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/846219