This is a template for any new hash-based MLST database

Why?

We want to have a space to share MLST alleles with mechanisms to add/remove/curate those alleles. We can admit that there is no perfect solution to this and so here are the advantages/disadvantages to our approach.

Advantages

Contextualize genomes with what else is out there
Alleles are hashed and so sequence data are not revealed
The hash is a fixed length, and so it is an easy check to see if an allele has been truncated.
Frees the database from funding sources.
Git repo!
- ... can be copied and/or made decentralized easily.
- ... can be versioned
- ... can be forked - individuals or institutions can decide to have their own database
- ... can be pushed - new alleles or loci can be updated
- ... can be pulled - databases can update with the latest alleles or loci

Disadvantages

Allelic sequences are lost through hashing.
The database creates a limited way that the database can be queried: either the query hits against an exact hashsum or it doesn't.
The database does not state whether any one allele conforms to any one rule. For example, it is unknown if a particular allele is bound by start and stop sites.
There is a lot of work ahead of us.

Database format

In the db folder, each scheme has these files.

refs.fasta - reference alleles for each locus
alleles.tsv - information on each allele
clusters.tsv - information on clusters. Clusters could be outbreak codes. Or, they could be something else like allele codes.
profiles.tsv - each sample and its alleles

The specification is at docs/specification.md

Example

python

mkdir -v db
python3 scripts/digestFasta.py t/senterica/*.tfa --out db/senterica.dbhpy --force

perl

mkdir -v db
perl scripts/digestFasta.pl t/senterica/*.tfa --out db/senterica.dbhpl --force

Installation

Clone the repo
Put scripts into your PATH

Usage

To add your own database, use this repo as a template and then add your database using the scripts. Make a new repo with it. Upload to a git hosting site such as github.

lskatz / mlst-hash-template

readme