This is a template for any new hash-based MLST database
Why?
We want to have a space to share MLST alleles with mechanisms to add/remove/curate
those alleles.
We can admit that there is no perfect solution to this and so here are the advantages/disadvantages to our approach.
Advantages
- Contextualize genomes with what else is out there
- Alleles are hashed and so sequence data are not revealed
- The hash is a fixed length, and so it is an easy check to see if an allele has been truncated.
- Frees the database from funding sources.
- Git repo!
- ... can be copied and/or made decentralized easily.
- ... can be versioned
- ... can be forked - individuals or institutions can decide to have their own database
- ... can be pushed - new alleles or loci can be updated
- ... can be pulled - databases can update with the latest alleles or loci
Disadvantages
- Allelic sequences are lost through hashing.
- The database creates a limited way that the database can be queried: either the query hits against an exact hashsum or it doesn't.
- The database does not state whether any one allele conforms to any one rule. For example, it is unknown if a particular allele is bound by start and stop sites.
- There is a lot of work ahead of us.
Database format
In the db folder, each scheme has these files.
refs.fasta
- reference alleles for each locus
alleles.tsv
- information on each allele
clusters.tsv
- information on clusters. Clusters could be outbreak codes. Or, they could be something else like allele codes.
profiles.tsv
- each sample and its alleles
The specification is at docs/specification.md
Example
python
mkdir -v db
python3 scripts/digestFasta.py t/senterica/*.tfa --out db/senterica.dbhpy --force
perl
mkdir -v db
perl scripts/digestFasta.pl t/senterica/*.tfa --out db/senterica.dbhpl --force
Installation
- Clone the repo
- Put
scripts
into your PATH
Usage
To add your own database, use this repo as a template and then add your database using the scripts.
Make a new repo with it.
Upload to a git hosting site such as github.