jdwinkler / resistome_generator

Development repo for the Resistome database and its tooling.
2 stars 0 forks source link

E. coli Resistome Database Repository

Welcome to the Resistome database repository! The code here is meant to minimally generate the Resistome database if the required inputs are present, as well as run basic analyses that you might be interested in using in your own research. It is not really meant to be used as a library but might be useful nonetheless.

The public Resistome website is here if you want to quickly search for high-level data concerning genotype-phenotype relationships.

About

The Resistome contains standardized representations of E. coli mutants resistant or sensitized to over 500 hundred types of inhibition (solvents, environmental stresses, antibiotics, and many others). This unique data source is intended to help synthetic biologists and evolutionary engineers to identify loci likely to affect their phenotype of interest for forward genetic engineering. Ultimately, we will hopefully be able to design explainable machine learning approaches to predict genotypes required to confer desired phenotypes using these curated data, or to improve the effectiveness of (targeted) adaptive laboratory evolution. Our curated dataset contains information on >10,000 E. coli mutants, including the specific variants detected in the strains from the library or selection experiments carried out in more than 440 studies (as of December 2020).

This database does not include information concerning phenotypes endowed by foreign genetic elements (either libraries or mobilizable elements/plasmids), as other databases such as CARD will cover those in more detail. A related database focused solely on ALE was recently published by the Palsson laboratory at aledb.org, in case it is also useful for your research.

Quick Setup

Here is the quick setup procedure:

  1. Setup a python >=3.8 (or equivalent) virtual environment.
  2. Install the python requirements using pip install -r requirements.txt.
  3. Install and setup Postgres (if needed).
  4. Download the latest database dump on the public Resistome website (dump, summary page).
  5. Restore the custom format SQL dump using pg_restore into your target database.
  6. Adjust the credentials in /inputs/db_credentials/credentials.txt to match your target database.
  7. Resist away!

You can see below for more details concerning the expected inputs and how to manually build the database. This repo will usually have a more updated database by virtue of being easier to deploy.

Required Inputs/Infrastructure

Database Server

The Resistome assumes a Postgres server is running on the localhost at port 5432. The default DB name is resistome with username and password defined in here. These constants can be changed if desired.

Database Schema

The annotation, Resistome, and "extra" schemas are stored under inputs/sql. An ER diagram can be generated using most available database management tools, such as DBeaver or pgAdmin. The supporting annotation databases are structured to maximize referential integrity but there are no cross schema (public-resistome) constraints.

Curated Data

The curated study input files are stored under inputs/database_store. The data are stored in a custom format that is hard to read/modify, but it should work well enough for analysis. Ideally, we will switch over to a combination of XML typed records with a validating schema after repeating the resequencing analysis for the relevant studies, but Resistome development is currently unfunded.

These files are generated using a parser like resistome/examples/record_generator.py. Every study essentially requires a custom parser, so there is little value in code re-use. In workflow management systems, you should try to automate the read preparation => variant calling => output generation using a more standard format like VCF, GenomeDiff, etc using a common reference.

NCBI/Uniprot Inputs

NCBI and Uniprot annotations are used to populate the annotation tables used by the Resistome to disambiguated curated datasets. These strains are:

The Genbank feature files (GBFF), RNA (rna_from_genomic), protein, CDS, genome, and feature tables are required for each strain. See this readme file for more informaton. The Uniprot table defining the K-12 proteome is also used for K-12 derivatives strains help disambiguate strain gene names further.

We previously utilized Biocyc databases to provide this same information for E. coli MG!655, REL606, and W strains.

Standardization

Gene names, phenotypes, and compound names are standardized to enable qualitative cross comparison between experiments. The files defining these mappings are included under inputs/standardization and inputs/settings. If you are adding more data with new phenotypes/compounds mentioned, you will need to update the files mentioned in the standardization READ ME. An error will be thrown if you attempt add a paper to the database that contains unknown phenotypes.

Gene name mappings can be generated using bidirectional best hits between E. coli strains if mappings are not publicly available. For E. coli strains, it is generally possible to determine the mapping from pre-computed annotations as strains will be tagged with the equivalent MG1655 gene (if present) by PGAP or other public annotation pipelines. A future update will examine adding a pipeline to handle de novo mapping generation using BLAST.

Protein-Protein Interactions

We use the EcoliNet functional gene-gene interaction network to add some additional predictive power for understanding how genotypes translate into phenotypes. However, this approach has not been extensively tested so you may wish to look at more detailed datasets.

Protein Change Effect Prediction Inputs

SNAP2, INPS, and DeMaSk are used to predict the effect of amino acid substitutions on protein function. These inputs were provided by external collaborators but are too large to distribute in this repository. You can download the original SNAP2 datasets generated in 2016 here; a future update will re-run these analyses for the strains currently represented in the Resistome. DeMaSk predictions for all strains can be found here. See the README for more details.

Once (if) AlphaFold2 becomes generally available, it may be possible to include assessments of structural impacts directly.

Citations:

RegulonDB Extraction

Currently, regulatory interactions are extracted from the provided NCBI databases and RegulonDB. See the README for more information. Genome annotations for promoters, operons, terminators, and DNA binding sites are also extracted from the database tables provided by the maintainers.

Metabolic Models

Both iJO1366 and iML1515 are included for use in simulating the metabolic impacts of mutations.

Full Setup

Assuming you have all the required data and have setup your postgres server, you are now ready to build the database locally. From the repository root, you should be able to run the following command:

python3 install_db.py

to start the database construction process. This script will first construct the NCBI support tables, followed by the Resistome table. See the resistome/sql/ncbi_data_parser.py and resistome/sql/resistome_builder.py for the actual work of constructing the public and Resistome tables respectively. Name mappings are constructed using the resistome/utils/name_helper.py script to help disambiguate gene names in curated data. A basic validation of uploaded data is performed by the resistome/sql/validator.py script.

You can build each database (support/Resistome) separately, but if you rebuild the NCBI tables, you should rebuild the Resistome as well. The build process should require no more than 10-15 minutes on a modern laptop. The design of the database tries to minimize inconsistencies arising from curation and reporting errors, but if you notice any problems or oddities in the data, please contact us.

Curation Accuracy

As of December 2020, ~95.5% of variant calls in the Resistome pass validation (e.g. WT bases or residues are correct, genomic locations are valid, genes are matched to accessions, etc). See resistome/sql/validator.py for more information. If genes involved in large (deletion, amplification, inversion) mutations are included, then >99% of annotations appear to be correct.

Adding New Data

It is usually pretty straightforward to add data to the Resistome by using the resistome/examples/record_generator.py script. You will need to alter this script to extract genetic or transcriptional data from the study or studies of interest depending on how they format their data. Mutation types must be only those found in inputs/settings/FK_InternalFields.txt entries with a "mutation" tag.

Unfortunately the annotation style can get pretty complicated, and there is no de jure definition of the annotation grammar. However, I suggest searching through the existing database_store files to see formats for each mutation type in the meantime. Some common ones:

The resistome_builder.py script will do as much QA checking as possible to make sure your entries match the expected format, but this checking will not be foolproof. You should also try to use absolute genomic coordinates if possible to simplify data analysis; at the moment, the positions in the following strains are checked: BW25113, BL21, BL21(DE3), MG1655, REL606, W, W3110, and MDS42. Other strains will be added as they become used for ALE or library experiments in the future.

Note: large mutations are automatically expanded into the complete mutation set: all genes between gene 1...gene N are included if they can be found in the supporting databases. Otherwise, only the first/last genes are associated with the mutations.

Web Interface

The code for the web interface hosted on the public Resistome website here can be found on Github. It will work with the database dump present in the repo above but may require updates to work with this repo directly as the database schema changes over time.

Basic Analysis

You can run the run_analysis.py script to generate many of the figures that have previously appeared in Resistome publications. All databases accesses go through the "resistome/sql/sql_interface.py" class if you need ideas on how to access the data easily from Python without worrying about writing raw SQL to do so.

Help/Collaborations

Please file an issue describing your problem with any error output you are getting. The Resistome is very much a research product, so expect rough edges unfortunately. You can also email James Winkler if you need assistance or would like to collaborate on a research project.

Resistome Publications

  1. Winkler, JD et al. "The Resistome: A Comprehensive Database of Escherichia coli Resistance Phenotypes", ACS Synthetic Biology (2016) DOI.
  2. Erickson, KE, Winkler, JD et al. "The Tolerome: A Database of Transcriptome-Level Contributions to Diverse Escherichia coli Resistance and Tolerance Phenotypes ", ACS Synthetic Biology (2017) DOI.
  3. Winkler, JD. "The Resistome: updating a standardized resource for analyzing resistance phenotypes", BiorXiv (2018) DOI

License

This work is licensed under CC BY-NC 4.0: see this webpage for additional information. RegulonDB has its own separate license and can only be used for academic/non-commercial research. Please see here for the full license terms.