This is an extension module for the Tripal project.
Please note this module requires Tripal 3 or greater. The Tripal 2 functional module is available for download but is no longer supported.
Tripal Analysis: Expression is a Drupal module built to extend the functionality of the Tripal toolset. The purpose of the module is to visually represent gene expression for Tripal features. This module requires the following Tripal modules:
git clone <URL>
on command line.drush en tripal_analysis_expression
(for more instructions, read the Drupal documentation page).Three loaders are provided by this module, a biosample loader, an expression loader, and a p-value loader. The biosample loader has the ability to load data from a flat file or from an xml file downloaded from NCBI. The expression loader can load expression data in column or matrix format. The p-value loader takes a csv or tsv file.
Once expression data is loaded, a display field will be shown on each feature page that has corresponding biosamples and expression values.
This module provides a search and results block to search for and select features to display in a heatmap.
The following bundles are defined by Tripal Protocol
The following fields are defined by Tripal Protocol
, Tripal Biomaterial
, and Tripal Analysis Expression
Expression Data
Biosample Browser
Biosamples may be loaded from a flat file or from a BioSample xml file downloaded from NCBI. The steps for loading biosamples are as follows (detailed instructions can be found further below):
To obtain a xml BioSample file from ncbi go the NCBI BioSample database. Search for and select the BioSamples you would like to download.
Click the "Send to:" link. Then select "File" and select "Full XML (text)" as the format. Then click "Create File".
Click here to see an example XML BioSample file from NCBI.
To upload the file into Chado/Tripal, navigate to:
Tripal->loaders->chado_biosample_loader
First, provide the path on the server to the biosample file, or use the file uploader. You must select an Organism to associate the biosamples with. You may also associate the imported biosamples with an analysis, but this is not required.
Press the Check Biosamples button to preview your biosample properties. To take advantage of a controlled vocabulary (CV), you must manually assign each property to a CVterm. The uploader will list all CV terms matching each property, and provide the CV, database (DB) and accession for the match. If a match does not exist for your term, use the CVterm browser to identify an appropriate CVterm in your Tripal site, and rename the property in your input file to match the term. If no term exists in your database, you should use the EBI ontology lookup service to identify an appropriate term and insert it manually, or, load the corresponding CV.
Pressing the 'Check Biosamples' button allows you to assign CVterms to every biosample property in your upload. If there isn't a suitable CVterm, you should rename it in your upload file to match a CVterm in the database and/or insert new CVterms.
After clicking "Submit job", the page should reload with the job status and Drush command to run the job. Copy and paste the Drush command and run it on command line. Upon running the Drush command, any warning/error/success/status message should be displayed.
Alternatively biosamples may be loaded from a flat file (CSV or TSV). The flat file loader is designed to upload files that are in the NCBI BioSample submission format which can be downloaded here. Download the TSV version of the file. The file must have a header that specifies the type of data in the column. There must be one column labeled "sample_name". The loader will begin to collect data from the line that follows the line containing "sample_name" which is assumed to be the header line. Columns are not required to be in any order. Other columns will be either attributes or accessions. Available NCBI attributes can be found here. Available accession headers are bioproject_accession, sra_accession, biosample_accession. All other columns will be uploaded as properties. To upload other accessions use the bulk loader provided with this module labeled, "Biomaterial Accession Term Loader". This loader will load a flat file with 3 columns (sample name, database name, accession term). A Tripal database must be created with the same name as the database name in the upload file.
Click here to see an example of a CSV file and a TSV file.
The Biosample loader can accept a server path, or, you can use the Tripal file uploader to directly upload files to the server.
If there are multiple sequencing runs associated with the same biosample that should be treated as independent biosamples, the XML loader will not work (it creates only one record). To get around this, follow these instructions:
1) Go to NCBI (https://www.ncbi.nlm.nih.gov/) and search for a specific organism ( make sure SRA is selected)
2) Narrow down search by selecting: a) Proper source, DNA or RNA (on the left side) b) Proper organism on the left side c) Platform, most likely illumina d) Any others, as needed
3) Select: Send to, then Run Selector
4) Click metadata to download the csv file
After loading, biosamples must be published to create entities for each biosample content type. As an administrator or user with correct permissions, navigate to Content->Tripal Content->Publish Tripal Content. Select the biological sample type to publish, apply any optional filtering, and press Publish.
Biosamples may also be loaded one at a time. As an administrator or a user with permission to create Tripal content, go to: Content->Tripal Content -> Add Tripal Content -> Biological Sample. Available biosamples fields include the following.
Properties - The properties describing this biosample, such as "age" or "geographic location". Each property type utilizes a CVterm.
Properties inserted into the database using the biosample bulk loader will be made available as new fields. They can be found by going to admin->structure->Tripal Content Types -> Biological Sample and pressing the + Check for New Fields button in the upper left hand of the screen.
Checking for new fields in the Structure-> Tripal Content Type admin area allows you to add existing properties to a Tripal content type. This allows you to manually enter values during content creation, as well as configure the display.
If you would like to create new properties, you may do so in the structure menu. Using the Add New Field row, enter the label and select Chado Property for the field type. After pressing Save, you must assign a CVterm to this property in the Controlled Vocabulary Term section. If an appropriate CVterm does not exist, you must insert it before you can create the field. To do so, navigate to tripal/loaders/chado_cvterms
and press the *Add Term** button.
If a desired property field does not exist, you can create it manually in the Structure-> Tripal Content Type admin area by setting the field type to 'Chado Property'
The steps for loading expression data are as follows (detailed instructions can be found further below):
Before loading data, describe the experimental setup used to collect the data. As an administrator or a user with permission to create content, navigate to content -> tripal content -> Analysis.
Note that program name, program version, and source name must be unique as a whole for analysis to be inserted correctly (click here to read more about the data structure for analysis).
The Chado Expression Data Loader provides a way for the user to load expression data associated with the experiment. The loader can load data from two types of formats, matrix and column. The matrix format expects a row of data containing biosample names. The first column should be unique feature names. Features must already be loaded into the database. Biosamples will be added if not present. Expression values will map to a biosample library in the column and a feature in the row. Only one matrix file may be loaded at a time. The column format expects the first column to contain features and the second column to be expression values.
For an example column file, click here. For an example matrix file, click here.
The biosample name will be taken as the name of the file minus the file extension. Features must already be loaded into the database. Biosamples will be added if not present. Multiple column format files may be loaded at the same time by uploading multiple files or, if providing a server path, if all files are in the same folder with the same file extensions. Either format may have header or footer information. Regex can be used in the form to only record data after the header and before the footer. The data loader fields are the following:
File Upload - You may upload a file using the loader, or provide a path on the server. The path may also be set to a directory, in which case all column files with the "File Type Suffix" specified above will be loaded. When loading multiple files, a file suffix (extension) must be specified.
Analysis - The analysis to associate the expression data with.
Organism (required) - The organism.
Source File Type - This can be either "Column Format" or "Matrix Format".
Name Match Type - Will the data be associated with feature names or unique names?
File Type Suffix - The suffix of the files to load. This is used to submit multiple column format files in the same directory. A suffix is not required for a matrix file.
Regex for Start of Data - If the expression file has a header, use this field to capture the line that occurs before the start of expression data. This line of text and any text preceding this line will be ignored.
Regex for End of Data - If the expression file has a footer, use this field to capture the line that occurs after the end of expression data. This line of text and all text following will be ignored.
The "Experimental Design" fields allow a complete description of the experimental design by implementing the Chado MAGE design schema. The Chado MAGE module uses the arraydesign, assay, quantification, and acquisition tables to describe an experiment. The Tripal Analysis Expression creates generic instances of all these for you.
The expression loader accepts parameters describing your file, as well as some experimental design parameters.
The steps for loading p-value data are as follows (detailed instructions can be found further below):
The Evidence & Conclusion Ontology contains evidence codes that will get associated with pvalue data uploaded through this module. To install this ontology:
Your raw data will likely come in a large spreadsheet that looks something like this, but probably with more columns and a lot more rows:
baseMean | log2FoldChange | lfcSE | stat | pvalue | |
---|---|---|---|---|---|
Fraxinus_pennsylvanica_120313_comp62618_c0_seq4 | 2818.699043 | 6.225005757 | 0.341692178 | 18.21816874 | 3.70E-74 |
Fraxinus_pennsylvanica_120313_comp57222_c0_seq2 | 1864.947509 | 6.034774942 | 0.346657723 | 17.40845374 | 7.12E-68 |
Fraxinus_pennsylvanica_120313_comp41195_c0_seq1 | 791.6553987 | 5.720052743 | 0.33026481 | 17.31959495 | 3.35E-67 |
The pvalue data loader only cares about two columns (order matters):
Copy the values from these columns (and only the values, we don't care about the headers) into a new sheet. The resulting spreadsheet should contain exactly two columns (feature name first, then pvalue) with no headers. Save the sheet as either a csv or a tsv. The resulting file should be in this format (csv), or alternatively use tab delimiters (tsv format):
FRAEX38873_v2_000000010.1,3.70E-74
FRAEX38873_v2_000000010.2,0.27
FRAEX38873_v2_000000020.1,0.39
FRAEX38873_v2_000000030.1,0.43
FRAEX38873_v2_000000040.1,0.85
FRAEX38873_v2_000000050.1,0.24
FRAEX38873_v2_000000060.1,0.13
FRAEX38873_v2_000000070.1,0.76
FRAEX38873_v2_000000080.1,0.88
FRAEX38873_v2_000000090.1,0.91
FRAEX38873_v2_000000100.1,0.99
...
You might also want to record the name of the sheet while you have it up; this will be needed for the experimental factor and evidence code needed in the form.
Your experimental factor and your evidence code will be directly tied to the name of the sheet. Here are a few examples of how sheet name ties into experimental factor and evidence code:
Sheet Name | Experimental Factor | Expression Relationship |
---|---|---|
Cold Up | response to cold | up-regulated |
EAB RvS Post-feeding down | response to insect | down-regulated |
Ozone Response | response to ozone | related to |
The analysis does not need to be unique for each set of p-value data. You could skip this step depending on how you organize your site.
As an administrator or a user with permission to create content, navigate to Content -> Tripal Content -> Analysis.
Note that program name, program version, and source name must be unique as a whole for analysis to be inserted correctly (click here to read more about the data structure for analysis).
The P-Value Data Loader provides a way for the user to load p-value data for a feature in relation to a keyword. The loader can load data from a two-column csv or tsv file, where the first column contains the unique name of the feature and the second column contains the p-value. Features must already be loaded into the database.
For an example p-value file, click here.
Loaded expression data can be viewed and downloaded by users in three places. Feature pages will gain access to the Expression field (data__gene_expression_data
). You can configure the appearance of this field by navigating to Structure -> Tripal Content Type -> [Feature type (ie, mRNA)]
. If the expression field is not listed, press the Check for New Fields button in the upper left. Once the field is attached, navigate to the Manage Display tab, enable the field display, and place it to your liking.
In this example, we have placed the Expression field in an a Tripal pane all of its own.
Data downloads are provided for individual features, analyses, and for feature sets selected in the heatmap. For data downloading to be functional, you must populate the materialized views associated with this module. This can be done by navigating to Tripal -> Data Storage -> Chado -> Materialized Views. Press the populate link for the expression_feature and *expression_feature_all** materialized views and run the submitted job. Materialized views must be manually repopulated when you add new data.
The expression field allows users to view all expression data available for a feature. Because a database might have multiple experiments involving a single feature, data is first organized by Analysis. Users can select analyses using the "Select an Expression Analysis" box which lists all analyses with expression data available. The plot can be further customized based on the biosample properties. The "Select a property to group and sort biosamples" select box will allow users to pick a property to organize samples along the X axis. Users may select Sample Name to elect not to group samples by property. Values may be colored by their expression value (default), or by selecting a different property in the "Select a property to color biosamples " box.
The Feature Expression Data field allows users to plot expression data according to biosample properties.
Once plotting parameters are set, users can click and drag to re-arrange both the legend and the individual groups. The "Only Non-Zero Values" button removes samples with expression values of 0, tidying the plot.
Data can be downloaded in matrix format by pressing the "Download expression dataset for this feature" link.
As with fields attached to feature, you must add the new Analysis fields and configure their display by navigating to Structure -> Tripal Content Type -> Analysis
.
The Biomaterial browser will list biosamples and their properties. The Expression Data field will not visualize expression data, but allow the user to download all expression data associated with this analysis in matrix form (rows: features, columns: biosamples).
This module creates two blocks: one for features input and the other displaying a heatmap for the input features.
Go to Structure->blocks and find these two blocks: tripal_analysis_expression features form for heatmap and tripal_elasticsearch block for search form: blast_merged_transcripts. Configure these two blocks to display at specific region and page(s). The tripal_analysis_expression features form for heatmap will display a form that allow you to input feature IDs.
After you enter feature IDs, you click the "Display Expression Heatmap" button to generate a heatmap for the features.
The Heatmap settings administrative page allows you to configure the heatmap search form. Here you can configure the placeholder text that appears, as well as the example feature ID's that the user can populate with the "try an example" button. There is also a checkbox which enables elasticsearch integration. Elasticsearch integration requires the tripal_elasticsearch
module and a configured elasticsearch instance. See the Tripal Elasticsearch module for more details and instructions.
The units associated with your expression data are stored in Chado associated with the quantification. Units can be stored even if you did not specify a quantification (a generic quantification was used in this case).
You can use the quantification units administrative page to add or edit units on your quantification by Navigating to Tripal -> Extensions -> Protocol -> Quantification Units. All quantification instances appear in the table at the bottom of the admin page. Click 'Edit' to change the units for an individual quantification.
You can also assign quantification units to all unitless quantifications using the Assign Units box.
Acquisition, Quantification, Array Design, and Assays all utilize protocols to describe them. Think of protocols as the experimental design, and Acquisitions, Quantifications, Array Designs, and Assays as the experiment following that experimental design.
Note that these content types are provided by Tripal Core.
There is currently no support for inputting, or displaying, acquisitions, quantifications, or assays. The Expression module creates generic instances of these entities.
Protocol Description - The protocol content types can be created by navigating to Add Tripal Content->Protocol. A protocol can be used to add extra detail to an experimental design. A protocol can be used to describe the assay, acquisition, and quantification steps of the experiment design. A protocol can also be used to further describe the array design content type. The fields of a protocol are:
This module shares some controlled vocabulary terms and database references with the tripal-eutils module. To provide compatibility between these two modules, as of tripal_biomaterial
update 7303 the following changes have been made:
drush updatedb
or by navigating to /update.php
on your site.SELECT DISTINCT CVT.cvterm_id, CV.name AS old_cv_name, CVT.name AS term_name, CVT2.cvterm_id, CV2.name AS new_cv_name FROM chado.biomaterialprop BP LEFT JOIN chado.cvterm CVT ON BP.type_id=CVT.cvterm_id LEFT JOIN chado.cv CV ON CVT.cv_id=CV.cv_id LEFT JOIN chado.dbxref X on CVT.dbxref_id=X.dbxref_id LEFT JOIN chado.db DB on X.db_id=DB.db_id LEFT JOIN chado.cvterm CVT2 ON REPLACE(CVT.name,' ','_')=REPLACE(CVT2.name,' ', '_') LEFT JOIN chado.cv CV2 ON CVT2.cv_id=CV2.cv_id WHERE CV.name IN ('biomaterial_property', 'ncbi_properties') AND CV2.name='NCBI BioSample Attributes';
For example:
cvterm_id | old_cv_name | term_name | cvterm_id | new_cv_name
-----------+----------------------+----------------------+-----------+---------------------------
3019 | biomaterial_property | sample_name | 65128 | NCBI BioSample Attributes
3114 | biomaterial_property | description | 64867 | NCBI BioSample Attributes
61429 | ncbi_properties | age | 64780 | NCBI BioSample Attributes
61456 | ncbi_properties | biomaterial provider | 64808 | NCBI BioSample Attributes
...
WITH LOOKUP AS (SELECT BP1.biomaterialprop_id, CVT2.cvterm_id FROM chado.biomaterialprop BP1 JOIN chado.cvterm CVT1 ON BP1.type_id=CVT1.cvterm_id JOIN chado.cv CV1 ON CVT1.cv_id=CV1.cv_id JOIN chado.cvterm CVT2 ON REPLACE(CVT1.name,' ','_')=REPLACE(CVT2.name,' ','_') JOIN chado.cv CV2 ON CVT2.cv_id=CV2.cv_id WHERE CV1.name IN ('biomaterial_property', 'ncbi_properties') AND CV2.name = 'NCBI BioSample Attributes') UPDATE chado.biomaterialprop BP3 SET type_id=LOOKUP.cvterm_id FROM LOOKUP WHERE BP3.biomaterialprop_id=LOOKUP.biomaterialprop_id;
Thanks goes to these wonderful people (emoji key):
Meg Staton π π€ | Bradford Condon π» | Abdullah Almsaeed π» | Joe West π» | Ming Chen π» | mboudet π | marcsilvaitqb π |
This project follows the all-contributors specification. Contributions of any kind welcome!