Build AMR PRG - Githubissues

mbhall88 commented 3 years ago

The first step of this is how to go from a mykrobe panel to an input format for make_prg?

There are two files of importance within the mykrobe metadata and panel data.

variant_to_resistance_drug-jan-03-2019.json, which maps variants to the drugs they cause resistance to

Example

    "embB_Q497R": [
        "Ethambutol"
    ],
    "fabG1_A-16X": [
        "Isoniazid"
    ],
    "fabG1_C-15X": [
        "Isoniazid"
    ],
    "fabG1_CTG607CTA": [
        "Isoniazid"
    ],

tb-hunt-probe-set-jan-03-2019.fasta.gz, which has the probes mykrobe uses

Example

>ref-I194T?var_name=ATC1674781ACG&num_alts=1&ref=NC_000962.3&enum=0&gene=inhA&mut=I194T
ATCTCGTTGCCGCAGGCCCTATCCGGACGCTGGCGATGAGT
>alt-I194T?var_name=ATC1674781ACG&enum=0&gene=inhA&mut=I194T
TCTCGTTGCCGCAGGCCCTACGCGGACGCTGGCGATGAGT
>ref-T1273TA?var_name=T761079TA&num_alts=1&ref=NC_000962.3&enum=0&gene=rpoB&mut=T1273TA
TCGCCGCGATCAAGGAGTTCTTCGGCACCAGCCAGCTGAGC
>alt-T1273TA?var_name=T761079TA&enum=0&gene=rpoB&mut=T1273TA
CGCCGCGATCAAGGAGTTCTATCGGCACCAGCCAGCTGAGC

Ultimately, what we need to end up with is an MSA for each gene present in these files.

One "naive" approach to this is to go through each probe and pull out the var_name and gene fields. We can then infer the var_name position (which appears to be H37Rv position) within gene, apply the variant var_name describes,

One thing that will be a problem here is there are some big variants, and some where every possible codon change is applied. We have to figure out a rule to filter these as doing it by hand is not going to work. I guess we could parse the probe file into a VCF format and have some script that filters that VCF before we move to creating the MSA. This would give us the flexibility to very easily change the filtering.

iqbal-lab commented 3 years ago

Could use the input file (that defines the catalogue, from which Martin builds the probes). Also the list of mutations is a zupp file in the last mykrobe paper

iqbal-lab commented 3 years ago

Its final.panel.tsv in the wellcome open research paper i think. Can't quite see column headers in phone

mbhall88 commented 3 years ago

Here is my proposed method for building an AMR PanRG.

Add drugs to panel, so the final panel looks like

pncA    GTT547GTG       DNA    drugA;drugB
rpoB    M1L     PROT    drugC

Create VCF reference. To do this, we take the panel, a GFF annotation file, and the reference genome, and output a FASTA file with a chromosome for each gene in the panel. We also add padding to the start and end of each gene (default 100bp I'm thinking).
Convert panel to VCF. For each variant in the panel, we create a VCF record. For the protein variants, we create a list of all the possible ALT alleles. One thing I will need to check here is what to do with the position for the amino acid variants. For instance, amino acid variant K2R - is this with respect to the reading frame. i.e. if the protein is encoded on the reverse strand, and the gene is 100bp long, position 2 in amino acid space would be position 94 in nucleic acid space. We can encode all of the panel information in the VCF INFO field (including the strand the gene is transcribed on). Additionally, positions will need to be "normalised" to account for the padding added to each gene - this is also necessary as we can't encode negative positions in VCF (promoters).
Create MSA from VCF
Make PRG
Index with pandora

iqbal-lab commented 3 years ago

Sorry, realised I had not commented. Looks good to me

mbhall88 commented 3 years ago

Done!
I had to use a minimum match length of 5 when running make_prg otherwise pandora index was just stalled as rpoB and pncA has some regions with mutations every base, so this caused huge numbers of mosaics.

Performance for the whole drprg build:

max memory 2.5Gb
runtime 10mins with 8 threads

iqbal-lab commented 3 years ago

Awesome

mbhall88 commented 3 years ago

I am reopening this issue as it has become apparent that building a PRG directly from the mykrobe panel may not be ideal. Effectively, it is too dense/complex and does not contain common non-resistant-causing SNPs that can occur near resistance-causing ones. This has lead to a number of FNs and FPs.

@iqbal-lab has suggested the following

What do we want in the PRG. We want big hitter R/S alleles in there. We want common polymirphisms that are nearby in there. The latter is achieved by just building from subsampled cryptic VCF. The former does rather encourage us to choose samples which have catalog mutations

We are waiting to hear back from Martin/Jeff about

we have built big tables of samples and which alleles they have, so in principle, we ought to be able to just collect a minimal list of samples that contains the common resistance mutations

Failing this, another solution is to use the sparse PRG from #3

In some tests I have run locally, the sparse PRG allows us to recover a very common FN in gyrA that is caused by the fact that the is a natural polymorphism that occurs right next to a resistance-causing one. The long story short is I think the number of alleles at that site creates problems when trying to pick the ML path and we end up picking a very wrong ML path, which is then the basis for de novo. I know this is all hand-waivy. I created a small PRG of gyrA and rpsL (just so there was more than one gene...) from the cryptic VCF (#3). I broke the VCF into the variants overlapping those genes. Then, instead of creating a sequence for every VCF position, I created a sequence for every sample, applying all of the variants for that sample to the reference sequence for the gene. For gyrA this was generally 3-5 variants.
Using the PRG I built from that, I was able to recover the gyrA variant. I think the density of the panel PRG is working against us.

Plan

My proposed plan (regardless of which strategy we use for the population PRG) is as follows

A script that takes the panel (for a list of genes), a GFF, and the VCF to build the PRG from. The output is a VCF with only those records within the genes from the panel and the CHROM and POS adjusted to the gene they are within. There will be a padding parameter for sequence up and downstream, along with a way of limiting the size of indels we add.
Create reference sequences for each genes, with padding, and index with samtools faidx (this has already been done)
Script that takes a VCF (from 1) and reference sequences (from 2) and an optional list of samples (for if we decide to handpick samples based on common resistance variants). It runs bcftools consensus for each sample in the VCF (or only those provided). This applies the called variants for the sample to the provided reference sequence and outputs a consensus sequence for each gene. We then combine the consensus sequences for each gene, along with the reference sequence, into a "pre-MSA" file.
Run MAFFT on the pre-MSA from 3
Run make_prg on the MSAs from 4
Index the PRG with pandora

If the above PRG leads to better results, I will add an option to drprg build where you can provide a PRG, which skips building a PRG from the panel.

Question: Running bcftools consensus in step 3 has one important option

 -H, --haplotype 1|2|R|A|I|LR|LA|SR|SA|1pIu|2pIu

    choose which allele from the FORMAT/GT field to use (the codes are case-insensitive):

    1
        the first allele, regardless of phasing 
    2
        the second allele, regardless of phasing 
    R
        the REF allele (in heterozygous genotypes) 
    A
        the ALT allele (in heterozygous genotypes) 
    I
        IUPAC code for all genotypes 
    LR, LA
        the longer allele. If both have the same length, use the REF allele (LR), or the ALT allele (LA) 
    SR, SA
        the shorter allele. If both have the same length, use the REF allele (SR), or the ALT allele (SA) 
    1pIu, 2pIu

        first/second allele for phased genotypes and IUPAC code for unphased genotypes

        This option requires *-s*, unless exactly one sample is present in the VCF

I am tentatively going to say we choose the ALT if it is a HET call. But happy to take other suggestions @iqbal-lab

iqbal-lab commented 3 years ago

This plan looks solid! in terms of the final question, as i understand it, this is only referring to situations where there is a het. i mean, those options which talk about "the first allele" or "the second allele" - these are about genotypes GT like this "1/2", right? ALT if it is a het call sounds fine. If it's a homozygous/haploid call, we take the called genotype allele, right?

mbhall88 commented 3 years ago

as i understand it, this is only referring to situations where there is a het.

Yes.

If it's a homozygous/haploid call, we take the called genotype allele, right?

Yep.

mbhall88 commented 2 years ago

No longer in the scope of this project. Additionally, the AMR PanRG was suboptimal as we used the sparse PanRG from #3. See Chapter 4 in my thesis for more discussion.

mbhall88 / head_to_head_pipeline

Build AMR PRG #66

Plan