lehner-lab / DiMSum

An error model and pipeline for analyzing deep mutational scanning (DMS) data and diagnosing common experimental pathologies
MIT License
26 stars 6 forks source link

Error: Cannot proceed with variant processing: No substitution variants found #12

Closed ijhoskins closed 1 year ago

ijhoskins commented 1 year ago

Hi @andrefaure

I ran into two errors in attempting to run DiMSum stages 4-5 on an externally provided count table.

Example experiment file and counts files are attached. I confirmed input sequences were all the same length.

DiMSum_WSN_HA_exp_file.txt

WSN_HA_DiMSum.raw.counts.subsamp.txt

When I run the following command I get the error for "No substitution variants found":

DiMSum --countPath=WSN_HA_DiMSum.raw.counts.txt --experimentDesignPath=DiMSum_WSN_HA_exp_file.txt --outputPath=dimsum_results --startStage=4 --sequenceType=coding --mixedSubstitutions=T --indels=none --fitnessMinInputCountAll=1 --wildtypeSequence=aagcaggggaaaataaaaacaaccaaaATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGAgattaggatttcagaaatataaggaaaaacaccc

******* DiMSum wrapper command-line arguments *******

runDemo                 FALSE
fastqFileExtension      .fastq
gzipped                 TRUE
stranded                TRUE
paired                  TRUE
barcodeErrorRate        0.25
experimentDesignPath    DiMSum_WSN_HA_exp_file.txt
experimentDesignPairDuplicates
                        FALSE
countPath               WSN_HA_DiMSum.raw.counts.txt
cutadaptMinLength       50
cutadaptErrorRate       0.2
cutadaptOverlap         3
vsearchMinQual          30
vsearchMaxee            0.5
vsearchMinovlen         10
outputPath              dimsum_results
projectName             DiMSum_Project
wildtypeSequence        aagcaggggaaaataaaaacaaccaaaATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGAgattaggatttcagaaatataaggaaaaacaccc
reverseComplement       FALSE
sequenceType            coding
mutagenesisType         random
transLibrary            FALSE
transLibraryReverseComplement
                        FALSE
bayesianDoubleFitness   FALSE
bayesianDoubleFitnessLamD
                        0.025
fitnessMinInputCountAll
                        1
fitnessMinInputCountAny
                        0
fitnessMinOutputCountAll
                        0
fitnessMinOutputCountAny
                        0
fitnessHighConfidenceCount
                        10
fitnessDoubleHighConfidenceCount
                        50
fitnessNormalise        TRUE
fitnessErrorModel       TRUE
indels                  none
maxSubstitutions        2
mixedSubstitutions      TRUE
retainIntermediateFiles
                        FALSE
splitChunkSize          3758096384
retainedReplicates      all
startStage              4
stopStage               5
numCores                1
help                    FALSE

******* Running DiMSum pipeline *******

Package version         1.2.0
R version               R version 4.0.3 (2020-10-10)

******* Binary dependency versions *******

Pandoc                  pandoc 1.17.2

******* DiMSum STAGE 4: PROCESS VARIANT SEQUENCES *******

Loading variant count files:
WSN_HA_DiMSum.raw.counts.txt
Processing...
    WSN_HA_DiMSum.raw.counts.txt
Processing merged variants...
Error: Cannot proceed with variant processing: No substitution variants found
Execution halted

Looking into the code base, I see line 86 of https://github.com/lehner-lab/DiMSum/blob/0badbf6616ea32cc1b8b17ed34e3d6ae691583d7/R/dimsum__process_merged_variants.R

This is the line throwing the error. When I run with --indels=all, I can bypass that error but then run into another error:

Processing...
    WSN_HA_DiMSum.raw.counts.txt
Processing merged variants...
Error in mapply(dimsum__hamming_distance, nt_seq, wt_ntseq) : 
  zero-length inputs cannot be mixed with those of non-zero length
Calls: dimsum ... dimsum__process_merged_variants -> [ -> [.data.table -> eval -> eval -> mapply
Execution halted

I tried converting the coding sequence to upper case and the flanking primer sequences to lower case. Full uppercase wild type sequence also causes the error.

Do you know what might be the problem?

Thanks! Ian

ijhoskins commented 1 year ago

Hi @andrefaure, I just realized the WT sequence I was providing had a stop codon whereas the counts file sequences lacked the stop codon. After providing WT sequence without the stop, things run fine! I did not realize overhangs will lead to issues during alignment.

andrefaure commented 1 year ago

Great I'm glad you managed to resolve the issue!