jodyphelan / tbdb

Standard database for the TBProfiler tool
GNU Lesser General Public License v3.0
28 stars 18 forks source link

TBDB: A repository for the TBProfiler library

This repository contains the scripts and data to generate all files required to run tb-profiler.

Why is there a seperate github repository?

With analysis pipelines pretty much standardised, it is evident that accuracy of prediction is affected mostly by the underlying library of mutations. As new evidence for the inclusion or exclusion of mutations is generated, there is a constant need to update and re-evaluate the mutation library. Moreover, it is important for the control of the library to be put in the hands of the end-users. By hosting the library on a separate repository (rather than buried deep in the profiling tool code) it makes it easier to find out exactly which mutations are present. Additionally, github has a number of useful features which can be utilised:

tl;dr - Hosting it here makes it easier to update the library.

Want to contribute?

If you think a mutation should be removed or added please raise and issue here. If you want to help curate the library, leave a comment here.

Adding/removing mutations

Mutations can be added by submitting a pull request on a branch modified tbdb.csv file. If that previous sentence made no sense to you then you can suggest a change using an issue and we will try help. On submitting a pull request the tbdb_bot will automatically calculate the confidence of the mutations in question and submit the results as a comment on the pull request (like this). All tbdb_bot checks should pass, at least two reviews should be requested and upon review can be merged into the master branch

How does it work?

The mutations are listed in tbdb.csv. These are parsed by parse_db.py to generate the json formatted database used by TBProfiler along with a few more files. Mutations can be removed and added from tbdb.csv and a new library can be built using parse_db.py.

tbdb.csv

This is a CSV file which must contain the following column headings:

  1. Gene - These can be the gene names (e.g. rpoB) or locus tag (e.g. Rv0667).
  2. Mutation - These must follow the hgvs nomenclature. More info down below.
  3. Drug - Name of the drug
  4. Literature - Any literature which provides evidence for the mutation. Pubmed IDs (e.g. PMC3315572)or DOIs are recommended but in theory anything can be put here. Multiple entries can be separated with ";".

The first three columns must contain a value, however literature may remain empty. Additional columns may be added and will be built into the json library, and can be output in the tb-profiler results.

Mutation format

Mutations must follow the HGVS nomenclature. Information on this format can be found here. The following types of mutations are currently allowed:

In addition, sequence ontology terms can also be used in place of the mutation. Supported sequence ontology terms can be found at http://pcingola.github.io/SnpEff/se_inputoutput/#effect-prediction-details. For example the following line is used to denote that any frameshift in katG confers resistance.

Gene Mutation Drug Confers Interaction Literature WHO Confidence
katG frameshift_variant isoniazid resistance

Additionally, ranges can be applied to sequence ontology terms to limit the resistance association of any particular term to a certain region within the gene or protein or non-coding RNA. To use gene coordinates add "_c.X_Y" where X and Y are the ranges between which the variant should occur in. Similarly, for protein you can use "_p.X_Y" and for RNA use "_n.X_Y". For example to use any missense variant between codon 430 and 470, the term will be missense_variant_p.450_470.

Important! The mutations and resulting library files are in reference to the H37Rv (NC_000962.3/AL123456.3) reference genome

Watchlist

There are some genes it may be of interest to record mutations even if we do not have any specific associated mutaitons. To allow this funcitonality we have included a "watchlist" file. To include genes just add them and the associated drug(s) to the tbdb.watchlist.csv file.

TB-Profiler

This repo contains all the files required to generate a library for tb-profiler. To find out more about how to build and load the library please visit the tb-profiler repo