lehner-lab / DiMSum

An error model and pipeline for analyzing deep mutational scanning (DMS) data and diagnosing common experimental pathologies
MIT License
28 stars 6 forks source link

Cannot proceed with error modelling #8

Closed shrutikhare-git closed 2 years ago

shrutikhare-git commented 2 years ago

Hi, I am the same user who had issues regarding absence of WT sequence in the data. I have now reformatted the mutant counts file I had from my own pipeline and used it as VariantCount.txt. I am now getting the following error:

DiMSum STAGE 5 (STEAM): ANALYSE VARIANT COUNTS

Filtering out low count nucleotide variants... Done Aggregating counts for biological output replicates... Done Fit error model... Error: Cannot proceed with error modelling: insufficent number of variants satisfying full fitness range Execution halted

What does this mean and how can I solve it? Thank you.

andrefaure commented 2 years ago

Hi @shrutivijayk,

There is a minimum number (and type) of variants that are required in order to proceed with fitting the fitness error model.

The type corresponds to variants that:

  1. are observed at least once (i.e. at least one read) in all input and output samples
  2. have a sufficient number of input reads to cover the full (empirical) fitness range. With a low number of input reads, the lower end of the fitness range (detrimental variants) is not properly estimated. So DiMSum estimates this minimum input read count threshold based on how the fitness distribution changes with increasing input counts.

The minimum number is defined as 30 * #experiment_replicates, so if you have 3 experiment_replicates you need at least 90 variants that are observed at least once in all samples and have input read counts above the threshold defined in [2] above.

The error you are getting means that currently the data is not meeting these requirements and this normally happens when there are very few variants retained from previous stages.

You can check the output of stage 4, specifically the read count table in the file "DiMSum_Project_variant_data_merge.tsv" (https://github.com/lehner-lab/DiMSum/blob/master/docs/FILEFORMATS.md#output-files) to see how many variants have been retained, what their counts are in different samples and whether this is what you expected. You can attach it here if small enough or send a link to share it from some other location if you like.

Hope this helps!

shrutikhare-git commented 2 years ago

Thanks for your reply. I have submitted my own VariantCount file to DiMSum for this run. It has ~700 variants (synonymous+nonsynonymous) which are present at least once in all 6 conditions (3 inputs and 3 outputs from 3 biological replicates). I used no explicit cutoffs for fitnessMinInputCountAll etc. so default should be 0 and this should not remove any variants.

I have attached the file here. First row includes WT counts. (sorry, had to remove sequence information as our data is not yet published). Thanks for your help.

andrefaure commented 2 years ago

Hi @shrutivijayk,

I had a look and this first plot below is fitness vs input read count (log scale) for your data with the threshold chosen by DiMSum indicated with the vertical dashed line: dimsum_stage_fitness_report_1_errormodel_fitness_inputcounts_1

You can see that very few variants satisfy this threshold and in general fitness is quite strongly anti-correlated with input counts - this suggests you probably need to sequence deeper.

This is the same plot for an example of another dataset with more sufficient sequencing data - above a certain input read count threshold, fitness is "well-behaved" i.e. no obvious correlation with input read counts (which is desirable of course if we want the estimates to report on selection and not simply abundance in the input): dimsum_stage_fitness_report_1_errormodel_fitness_inputcounts_2

Hope this helps!

(I'm going to close this issue now because I think it is clear that this is not an issue with DiMSum but rather an issue with the data you are trying to analyse.)

shrutikhare-git commented 2 years ago

Thank you. The experiment is not a growth based assay. Only a subset of mutants are expected to survive/get enriched upon selection. Can DiMSum be used to analyse such data? Is there any way to edit the input variant count cutoff DiMSum is using?