pombase / allele_qc

Quality control for PomBase alleles
MIT License
1 stars 1 forks source link

Allele QC for PomBase

A series of scripts for quality control of alleles. Currently used for PomBase, but could be adapted.

It checks that:

TL;DR;

# Install dependencies
poetry install

# Activate python environment
poetry shell

# Set up the necessary transvar variables (you must have installed transvar, see next section)
. transvar_env_vars.sh
bash set_up_transvar.sh

# Run this script (See the comments in the subscripts)
bash run_analysis.sh

Installing

Python dependencies

To install the dependencies, we used poetry (see poetry installation instructions).

In the source directory run:

poetry install

This should create a folder .venv with the python virtual environment. To activate the virtual environment, then run:

poetry shell

Now when you call python, it will be the one from the .venv.

Other dependencies and setting up transvar

This project uses transvar. This requires to install some binaries.


# If you have linux and you want to install them globally
sudo apt install -y samtools tabix

# Installing globally in mac
brew install htslib samtools

# If you want to install them locally (see the content of the script)
# > basically downloads the libs and uses make to build the necessary bin files, then deletes all unnecesary source code
bash install_transvar_dependencies_locally.sh

Then, regardless of whether you are using local or global installation of samtools and tabix:

# Env vars (see script)
. transvar_env_vars.sh

# Build the transvar database, and test that it works
bash set_up_transvar.sh

What the pipeline does

The best thing is to look at the script run_analysis.sh, the subscripts are well documented.

Defining syntax rules in a grammar

These are used to interpret the allele descriptions, check that the sequence residues they refer to are correct, and to format the description correctly/

We define "syntax rules" representing the syntax of a type of mutation as dictionaries in a python list that we call a "grammar". The dictionaries are parsed into SyntaxRule objects (see models.py).

A full grammar can be found in grammar.py, and the best is to go through that example and the tests to understand how it works. Below an example of a rule to represent several single aminoacid mutations, in the form of VP120AA (Valine and Proline in position 120 and 121 replaced by Alanines).

aa = 'GPAVLIMCFYWHKRQNEDST'
aa = aa + aa.lower()
aa = f'[{aa}]'

{
        'type': 'amino_acid_mutation',
        'rule_name': 'single_aa',
        'regex': f'(?<=\\b)({aa})(\d+)({aa})(?=\\b)',
        'apply_syntax': lambda g: ''.join(g).upper(),
        'check_sequence': lambda g, gg: check_sequence_single_pos(g, gg, 'peptide'),
        'further_check': lambda g, gg: g[0] != g[2],
        'format_for_transvar': lambda g, gg: [f'p.{g[0]}{g[1]}{g[2]}']
    },

Defining allele categories

In PomBase we use categories for allele types, depending on the types of mutations they contain. These are described in a dictionary that uses frozenset objects as keys in grammar.py. For example:

allowed_types = {
    frozenset({'amino_acid_mutation'}): 'amino_acid_mutation',
    frozenset({'partial_amino_acid_deletion'}): 'partial_amino_acid_deletion',
    frozenset({'amino_acid_mutation','partial_amino_acid_deletion'}): 'partial_amino_acid_deletion'
    }

This is convenient because it allows to represent the fact that an allele that contains only the types of mutations indicated by a frozenset in the dictionary is of that type, e.g., if it has only amino_acid_mutation, it is of type amino_acid_mutation, if it contains amino_acid_mutation and partial_amino_acid_deletion, it is of type amino_acid_deletion_and_mutation.

Config file

See config.json, the variables included there are self-explaining. See also data/sgd/config.sgd.json for SGD.

New columns in allele file after analysis

Optional - Using old coordinta changes for fixes

Some of the alleles for which sequence_errors are found might result from residue coordinates refering to previous gene structures. E.g. if the starting methionine has been changed, all residue coordinates are shifted. To fix this case, we use a genome change log produced with https://github.com/pombase/genome_changelog (for PomBase, see build_alignment_dict_from_genome.py). For DNA sequence, since the probability of getting the right nucleotide by chance is ~25%, we cannot be sure it is safe to switch coordinates even if that gives the right nucleotide.

For sgd data, we use a method that uses only the protein sequences: https://github.com/pombase/all_previous_sgd_peptide_sequences, see build_alignment_dict_from_peptides.py.

Running the API in Docker

docker build -t allele_qc_api .
docker run -d --name apicontainer -p 8000:80 allele_qc_api

Then if you go to http://localhost:8000/ you should be redirected to the API documentation and you can run a test request directly there.