Open mestato opened 6 years ago
genome just published! So no need to load these transcriptomes as independent analyses. Instead, after loading the geome (#298), then these can be reanalyzed as expression data against the gene models.
There wasnt a kegg annotaion or upload. Kegg annotation: https://www.hardwoodgenomics.org/KEGGresults/2843415
https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP100976 This can tell us what sample from "De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech" was from what day
The plant ID determines which ones are the control and which ones are the treated The following samples are control samples: 14/16/21/36/13
These samples are the stressed ones: 43/45/46/48/51
Fastqc
for f in ../../raw_data/*.fastq
do
filename=$(basename "$f")
base="${filename%%.fastq*}"
echo "filename $filename base $base"
mkdir $base.fastQC
/staton/software/FastQC-v0.11.5/FastQC/fastqc -o $base.fastQC $f >& $base.fastQC.out
done
Skewer
for R1 in for R1 in /staton/projects/undergrads/fagus_sylvatica/raw_data/*_1.fastq
do
R2=`sed 's/_1/_2/' <(echo $R1)`
BASE=$( basename $R1 | sed 's/_1.fastq*//g')
echo "R1 $R1"
echo "R2 $R2"
echo "BASE $BASE"
/staton/software/skewer/skewer \
-x /staton/software/Trimmomatic-0.38/adapters/all.fa \
-l 30 \
$R1 \
$R2 \
-o $BASE \
>& $BASE.trim_output &
done
Indexing
/staton/software/STAR-2.6.1a/bin/Linux_x86_64/STAR \
--runMode genomeGenerate \
--genomeDir genomeDir \
--genomeFastaFiles /staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_genome.fasta \
--sjdbGTFfile /staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_cds_v1.3.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 100 &
Alignment (A variation of)
cat /staton/projects/undergrads/fagus_sylvatica/assembly/star/file_list1.txt | while read line
do
BASE=$( basename $line | sed 's/.log*//g')
echo "BASE $BASE"
/staton/software/STAR-2.6.1a/STARlong \
--genomeDir genomeDir \
--readFilesIn ../skewer/$BASE-pair1.fastq ../skewer/$BASE-pair2.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix $BASE. &
done
HTSeq
cat /staton/projects/undergrads/fagus_sylvatica/assembly/star/drought/.bam/bam_list1.sh | while read bam
do
base=$( basename $bam | sed 's/.sorted.bam//g')
echo "bam $bam"
echo "base $base"
echo "--"
/staton/software/htseq-count \
--format=bam \
--order=pos \
--stranded=no \
--type=gene \
--idattr=ID \
/staton/projects/undergrads/fagus_sylvatica/assembly/star/drought/.bam/$bam \
/staton/projects/undergrads/fagus_sylvatica/assembly/star/genomeDir/Fagus_sylvatica_genome.fasta.gff3 \
>$base.counts.txt \
2> $base.out &
echo "-------"
done
Biosample Upload https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809141
Biosample Published https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809142
Uploaded Biomaterial data sample https://hardwoodgenomics.org/biologicalsample/3500217?tripal_pane=group_summary_tripalpane
Biosamples published https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809134
Fagus sylvatica Expression analysis https://hardwoodgenomics.org/Analysis/3500240
Normalization Upload https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/809922
@almasaeed2010 The expression heatmap gives the error There is no expression data available for the features you entered.
I followed this wiki https://tripal-devseed.readthedocs.io/en/latest/loading_expression_data.html
to upload it and used transcript
for the sequence type.
The link to the normalization upload is up above.
This is the error that caused the failure. Unfortunately, tripal didn't report that an error happened 😞
from European Beech for Fagus sylvatica Expression ERROR (TRIPAL_ANALYSIS_EXPRESSION):
Could not copy /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm to public://expression/5d4c7313ac449_tpm.
[site http://default] [TRIPAL ERROR] [TRIPAL_ANALYSIS_EXPRESSION]
Could not copy /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm to public://expression/5d4c7313ac449_tpm.
ERROR: Failed to cache file /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm
https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/812733 Normalization rerun
The files seem to have a different transcript name than the fasta files.
[TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] Feature scaffold4171_size32609 not found
[site http://default] [TRIPAL WARNING] [TRIPAL_ANALYSIS_EXPRESSION] The feature, scaffold4171_size32609, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.
We figured out what was going on; STAR produces a GFF file containing only the information on where each scaffold begins and ends, and that was accidentally used instead of the correct GFF3 file. Ray is rerunning HTSeq using the correct GFF3 file, which contains the gene IDs.
https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/813904 Normalization re-upload
Write a regex that adds 01 at the end.
I have a solution that will add the 01 to each gene name using the matrix.php script. @RaymondS1 remind me to fix the script the next time you are working please.
https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/820426 Normalization re-run
Expression Uploaded https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/825751
Edit: Upload error
Failed to cache file /var/www/html/sites/default/files/sequences/fagus_sylvatica_transcriptome/tpm/tpm/results.tsv
@almasaeed2010
This is caused probably by permissions issues. @jwest60 since you developed the caching feature for expression files, could you take a look at the code to see what might be causing this? The error only show the file that is supposed to be cached but doesn't say where it is getting copied to. Try to find out where that folder is and let me know. I can attempt to fix the permissions from there.
The files should be getting cached in sites/default/files/expression
.
Expression shows up now.
https://hardwoodgenomics.org/bio_data/2476482?tripal_pane=group_expression
https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/830308 Biosample reupload
https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/831063 Expression reupload
@cricha59 Can you look over this organism so i can go ahead and close it
There are actually two public transcritomes.
De novo transcriptome assembly and analysis of differential gene expression in response to drought in European beech Raw sequence reads were submitted to the Sequence Read Archive (SRA) of NCBI under the accession number SRP100976. The de novo transcriptome assembly is available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.8k3q2.
A unigene set for European beech (Fagus sylvatica L.) and its use to decipher the molecular mechanisms involved in dormancy regulation Sanger sequences were deposited in the NCBI database (NCBI Accession: LIBEST_026606 WZOAFSAA and LIBEST_026417 WZOAFSCA). Roche 454 raw sequence reads have also been submitted to the Sequence Read Archive of NCBI (Accession no. SRX105749, SRX105748, SRX105747, SRX105746 and SRX105663). Assembled contigs (n = 21 057) can be downloaded from: http://ngspipelines.toulouse.inra.fr:9000/.
Both have quite interesting associated samples for expression data. But our database isn't really built to support more than one transcript set for an organism, so I think we have to pick one... need to discuss.