lehner-lab / DiMSum

An error model and pipeline for analyzing deep mutational scanning (DMS) data and diagnosing common experimental pathologies
MIT License
28 stars 6 forks source link

Feature request: annotate genetic variants in standard nomenclature #3

Closed ijhoskins closed 2 years ago

ijhoskins commented 2 years ago

Hello @andrefaure,

It looks like DiMSum outputs variant sequences and amino acids as strings. Compact protein change annotations are available in the fitness_singles.txt and fitness_doubles.txt files, but as far as I can see changes at the genetic level are not reported in compact form (e.g. HGVS nomenclature).

Is there a way to extract this information from any of the serialized R objects? If not, is this something that could be supported?

Thanks!

andrefaure commented 2 years ago

Hi @ijhoskins,

DiMSum does not report changes at the genetic level for protein sequences, because the estimates are the result of aggregating reads, fitness and errors over synonyms.

If you would prefer not to merge synonyms simply specify '--sequenceType'='noncoding'. The resulting estimates and sequence changes will be reported at the DNA/genetic level e.g. singles as T25A, which can be easily reformatted to HGVS nomenclature. Note that in this case you will most likely have multiple fitness estimates for a given protein sequence (if in reality your target is a coding sequence).

Fitness estimates for higher order mutants (hamming distance>2) are available in the 'DiMSum_Project_fitness_replicates.RData' object (https://github.com/lehner-lab/DiMSum/blob/master/docs/FILEFORMATS.md#output-files). As you point out variant sequences are provided as strings - the 'compact' DNA/protein change annotations will need to be obtained manually by comparing them to the WT sequence. DiMSum does not currently output this representation for higher order mutants (hamming distance>2) because the variant sequence is often more compact.

Hope this helps!

ijhoskins commented 2 years ago

Hi @andrefaure thanks for your response. For my purposes I am interested in just the variant calls and not in the fitness estimates, so I will run with --sequenceType'='noncoding'. Just to clarify, in this mode higher-order HD=2 mutants (MNP/MNVs or haplotypes) will still be reported? Or do I need to access that subset of variants using "DiMSum_Project_fitness_replicates.RData"?

andrefaure commented 2 years ago

Hi @ijhoskins, if you're just interested in the variant calls then you only need to run stages 1-4. All variants (up to and including the maximum mutation order specified by the '--maxSubstitutions' argument) and their associated counts in all samples is output in the 'DiMSum_Project_variant_data_merge.tsv' file: https://github.com/lehner-lab/DiMSum/blob/master/docs/FILEFORMATS.md#output-files (unfortunately this file does not have compact nucleotide/protein change annotations i.e. all sequences are outputted as strings)