Closed mbhall88 closed 2 years ago
Could use the input file (that defines the catalogue, from which Martin builds the probes). Also the list of mutations is a zupp file in the last mykrobe paper
Its final.panel.tsv in the wellcome open research paper i think. Can't quite see column headers in phone
Here is my proposed method for building an AMR PanRG.
Add drugs to panel, so the final panel looks like
pncA GTT547GTG DNA drugA;drugB
rpoB M1L PROT drugC
Create VCF reference. To do this, we take the panel, a GFF annotation file, and the reference genome, and output a FASTA file with a chromosome for each gene in the panel. We also add padding to the start and end of each gene (default 100bp I'm thinking).
Convert panel to VCF. For each variant in the panel, we create a VCF record. For the protein variants, we create a list of all the possible ALT alleles. One thing I will need to check here is what to do with the position for the amino acid variants. For instance, amino acid variant K2R
- is this with respect to the reading frame. i.e. if the protein is encoded on the reverse strand, and the gene is 100bp long, position 2 in amino acid space would be position 94 in nucleic acid space. We can encode all of the panel information in the VCF INFO field (including the strand the gene is transcribed on). Additionally, positions will need to be "normalised" to account for the padding added to each gene - this is also necessary as we can't encode negative positions in VCF (promoters).
Create MSA from VCF
Make PRG
Index with pandora
Sorry, realised I had not commented. Looks good to me
Done!
I had to use a minimum match length of 5 when running make_prg
otherwise pandora index
was just stalled as rpoB and pncA has some regions with mutations every base, so this caused huge numbers of mosaics.
Performance for the whole drprg build
:
Awesome
I am reopening this issue as it has become apparent that building a PRG directly from the mykrobe panel may not be ideal. Effectively, it is too dense/complex and does not contain common non-resistant-causing SNPs that can occur near resistance-causing ones. This has lead to a number of FNs and FPs.
@iqbal-lab has suggested the following
What do we want in the PRG. We want big hitter R/S alleles in there. We want common polymirphisms that are nearby in there. The latter is achieved by just building from subsampled cryptic VCF. The former does rather encourage us to choose samples which have catalog mutations
We are waiting to hear back from Martin/Jeff about
we have built big tables of samples and which alleles they have, so in principle, we ought to be able to just collect a minimal list of samples that contains the common resistance mutations
Failing this, another solution is to use the sparse PRG from #3
In some tests I have run locally, the sparse PRG allows us to recover a very common FN in gyrA that is caused by the fact that the is a natural polymorphism that occurs right next to a resistance-causing one. The long story short is I think the number of alleles at that site creates problems when trying to pick the ML path and we end up picking a very wrong ML path, which is then the basis for de novo. I know this is all hand-waivy. I created a small PRG of gyrA and rpsL (just so there was more than one gene...) from the cryptic VCF (#3). I broke the VCF into the variants overlapping those genes. Then, instead of creating a sequence for every VCF position, I created a sequence for every sample, applying all of the variants for that sample to the reference sequence for the gene. For gyrA this was generally 3-5 variants.
Using the PRG I built from that, I was able to recover the gyrA variant.
I think the density of the panel PRG is working against us.
My proposed plan (regardless of which strategy we use for the population PRG) is as follows
samtools faidx
(this has already been done)bcftools consensus
for each sample in the VCF (or only those provided). This applies the called variants for the sample to the provided reference sequence and outputs a consensus sequence for each gene. We then combine the consensus sequences for each gene, along with the reference sequence, into a "pre-MSA" file.If the above PRG leads to better results, I will add an option to drprg build
where you can provide a PRG, which skips building a PRG from the panel.
Question: Running bcftools consensus in step 3 has one important option
-H, --haplotype 1|2|R|A|I|LR|LA|SR|SA|1pIu|2pIu
choose which allele from the FORMAT/GT field to use (the codes are case-insensitive):
1
the first allele, regardless of phasing
2
the second allele, regardless of phasing
R
the REF allele (in heterozygous genotypes)
A
the ALT allele (in heterozygous genotypes)
I
IUPAC code for all genotypes
LR, LA
the longer allele. If both have the same length, use the REF allele (LR), or the ALT allele (LA)
SR, SA
the shorter allele. If both have the same length, use the REF allele (SR), or the ALT allele (SA)
1pIu, 2pIu
first/second allele for phased genotypes and IUPAC code for unphased genotypes
This option requires *-s*, unless exactly one sample is present in the VCF
I am tentatively going to say we choose the ALT if it is a HET call. But happy to take other suggestions @iqbal-lab
This plan looks solid! in terms of the final question, as i understand it, this is only referring to situations where there is a het. i mean, those options which talk about "the first allele" or "the second allele" - these are about genotypes GT like this "1/2", right? ALT if it is a het call sounds fine. If it's a homozygous/haploid call, we take the called genotype allele, right?
as i understand it, this is only referring to situations where there is a het.
Yes.
If it's a homozygous/haploid call, we take the called genotype allele, right?
Yep.
No longer in the scope of this project. Additionally, the AMR PanRG was suboptimal as we used the sparse PanRG from #3. See Chapter 4 in my thesis for more discussion.
The first step of this is how to go from a
mykrobe
panel to an input format formake_prg
?There are two files of importance within the
mykrobe
metadata and panel data.variant_to_resistance_drug-jan-03-2019.json
, which maps variants to the drugs they cause resistance toExample
tb-hunt-probe-set-jan-03-2019.fasta.gz
, which has the probesmykrobe
usesExample
Ultimately, what we need to end up with is an MSA for each gene present in these files.
One "naive" approach to this is to go through each probe and pull out the
var_name
andgene
fields. We can then infer thevar_name
position (which appears to be H37Rv position) withingene
, apply the variantvar_name
describes,One thing that will be a problem here is there are some big variants, and some where every possible codon change is applied. We have to figure out a rule to filter these as doing it by hand is not going to work. I guess we could parse the probe file into a VCF format and have some script that filters that VCF before we move to creating the MSA. This would give us the flexibility to very easily change the filtering.