A series of scripts for quality control of alleles. Currently used for PomBase, but could be adapted.
It checks that:
# Install dependencies
poetry install
# Activate python environment
poetry shell
# Set up the necessary transvar variables (you must have installed transvar, see next section)
. transvar_env_vars.sh
bash set_up_transvar.sh
# Run this script (See the comments in the subscripts)
bash run_analysis.sh
To install the dependencies, we used poetry (see poetry installation instructions).
In the source directory run:
poetry install
This should create a folder .venv
with the python virtual environment. To activate the virtual environment, then run:
poetry shell
Now when you call python
, it will be the one from the .venv
.
This project uses transvar. This requires to install some binaries.
# If you have linux and you want to install them globally
sudo apt install -y samtools tabix
# Installing globally in mac
brew install htslib samtools
# If you want to install them locally (see the content of the script)
# > basically downloads the libs and uses make to build the necessary bin files, then deletes all unnecesary source code
bash install_transvar_dependencies_locally.sh
Then, regardless of whether you are using local or global installation of samtools
and tabix
:
# Env vars (see script)
. transvar_env_vars.sh
# Build the transvar database, and test that it works
bash set_up_transvar.sh
The best thing is to look at the script run_analysis.sh
, the subscripts are well documented.
These are used to interpret the allele descriptions, check that the sequence residues they refer to are correct, and to format the description correctly/
We define "syntax rules" representing the syntax of a type of mutation as dictionaries in a python list that we call a "grammar". The dictionaries are parsed into SyntaxRule
objects (see models.py).
A full grammar can be found in grammar.py, and the best is to go through that example and the tests to understand how it works. Below an example of a rule to represent several single aminoacid mutations, in the form of VP120AA
(Valine and Proline in position 120 and 121 replaced by Alanines).
aa = 'GPAVLIMCFYWHKRQNEDST'
aa = aa + aa.lower()
aa = f'[{aa}]'
{
'type': 'amino_acid_mutation',
'rule_name': 'single_aa',
'regex': f'(?<=\\b)({aa})(\d+)({aa})(?=\\b)',
'apply_syntax': lambda g: ''.join(g).upper(),
'check_sequence': lambda g, gg: check_sequence_single_pos(g, gg, 'peptide'),
'further_check': lambda g, gg: g[0] != g[2],
'format_for_transvar': lambda g, gg: [f'p.{g[0]}{g[1]}{g[2]}']
},
type
: the type of mutation (see below)rule_name
: type
+ rule_name
should be unique, it identifies the rule matching a particular mutation.regex
: a regular expression with a pattern that represents a permisive syntax pattern for this type of mutation.
{aa}
inside the python f string to represent any aminoacid.apply_syntax
: a function that takes a tuple of re.Match[str]
, representing a match in a string to the pattern described in regex
, and returns the correctly-formatted mutation.
VP120AA
, the groups are ('VP', '120', 'AA')
, and the function returns VP120AA
.lambda
, or outside of the dictionary.check_sequence
: a function that takes the two arguments below and checks whether the sequence position indicates exist (the index is within the boundaries of the sequence), and contain the indicated residue. If sequence is correct, should return empty string, otherwise, the residue or residues that are incorrect, see the examples in grammar.py
and the tests.
re.Match[str]
, representing a match in a string to the pattern described in regex
load_genome.py
further_check
: takes the same argument as above, and is used as an extra check on top of matching the regex
pattern. In this example, it checks that the parts before and after the number are different (e.g. VP120VP
will not be matched by this grammar rule, even if it matches the pattern).format_for_transvar
: takes the same arguments as above, and returns the transvar formatted variant.In PomBase we use categories for allele types, depending on the types of mutations they contain. These are described in a dictionary that uses frozenset
objects as keys in grammar.py
. For example:
allowed_types = {
frozenset({'amino_acid_mutation'}): 'amino_acid_mutation',
frozenset({'partial_amino_acid_deletion'}): 'partial_amino_acid_deletion',
frozenset({'amino_acid_mutation','partial_amino_acid_deletion'}): 'partial_amino_acid_deletion'
}
This is convenient because it allows to represent the fact that an allele that contains only the types of mutations indicated by a frozenset in the dictionary is of that type, e.g., if it has only amino_acid_mutation
, it is of type amino_acid_mutation
, if it contains amino_acid_mutation
and partial_amino_acid_deletion
, it is of type amino_acid_deletion_and_mutation
.
See config.json
, the variables included there are self-explaining. See also data/sgd/config.sgd.json
for SGD.
allele_parts
: the substrings of the allele_description
that match regex patterns in the grammar, separated by |
characters. For example E325A G338D
would result in E325A|G338D
.needs_fixing
: True
or False
depending on whether the allele needs fixing.change_description_to
: if the correct nomenclature differs from allele_description
, change_description_to
contains the right syntax. E.g. for E325A G338D
contains E325A,G338D
.rules_applied
: for each of the allele_parts, the syntax rule type
and name
as |
-delimited type:name
. E.g. for VP120AA,E325A
contains amino_acid_mutation:multiple_aa|amino_acid_mutation:single_aa
.pattern_error
: contains the parts of the allele that are not picked up by any regular expression in the grammar.invalid_error
: if the systematic_id or sequence of a gene product is missing.sequence_error
: contains an error if the indicated position does not exist or if the aminoacid indicated at a certain position is not correct. Values to be homogenised in the future.change_type_to
: if allele_type
is not right, contains the right type. This comes from using the frozenset
in grammar.py
.auto_fix_comment
: for alleles that can be auto-fixed, contains some info (e.g.):
syntax_error
type_error
syntax_and_type_error
multi_shift_fix
: For a given publication with 4 or more alleles with errors, if shifting all coordinates by the same amount fixes the error, the fix is accepted. This fix has the lowest priority.histone_fix
: histone indexes have been often counted ignoring the first methionine, so if a shift by +1 in index fixes the sequence error in a histone, it is accepted. This type of fix takes second priority.old_coords_fix, revision xxx: gene_coordinates
: if using old gene coordinates the error is fixed, the fix is accepted. This type of fix takes max priority, as it is in principle more reliable.solution_index
: Normally empty, but if more than one solution has been found to a sequence error, they have the index of the solution.Some of the alleles for which sequence_error
s are found might result from residue coordinates refering to previous gene structures. E.g. if the starting methionine has been changed, all residue coordinates are shifted. To fix this case, we use a genome change log produced with https://github.com/pombase/genome_changelog (for PomBase, see build_alignment_dict_from_genome.py
). For DNA sequence, since the probability of getting the right nucleotide by chance is ~25%, we cannot be sure it is safe to switch coordinates even if that gives the right nucleotide.
For sgd data, we use a method that uses only the protein sequences: https://github.com/pombase/all_previous_sgd_peptide_sequences, see build_alignment_dict_from_peptides.py
.
docker build -t allele_qc_api .
docker run -d --name apicontainer -p 8000:80 allele_qc_api
Then if you go to http://localhost:8000/ you should be redirected to the API documentation and you can run a test request directly there.