1) Code generation:
All code generation code has been updated to reflect master (there were previously some minor differences)
2) Formal contamination check
If we see multiple species there's now a change in the models compared:
When resistotyping at genes and mutations we now consider two coverages.
a) The expected coverage on target species (as before)
and
b) The maximum coverage on all non-target species (called contamination_covg)
Then if contamination_covg > 0 we compare 2 models for S taking the ML model:
Is the resistant coverage due to contamination with S from target
This PR pulls two things to master:
1) Code generation: All code generation code has been updated to reflect master (there were previously some minor differences)
2) Formal contamination check If we see multiple species there's now a change in the models compared: When resistotyping at genes and mutations we now consider two coverages.
a) The expected coverage on target species (as before)
and
b) The maximum coverage on all non-target species (called contamination_covg)
Then if contamination_covg > 0 we compare 2 models for S taking the ML model:
Is the resistant coverage due to contamination with S from target
Is the resistant coverage due to errors with S from target
We then take the most likely of these for our llk_S model
And similarly for R
Is the resistant coverage due to target with S from errors
Is the resistant coverage due to target with S from contamination
The case with no contamination is unchanged.
Testing on Staph shows improvement in specificity with minor decrease in sensitivity. Will update with stats.
Testing on TB shows no change in specificity or sensitivity as contamination is far rarer in these datasets.