Error: Cannot proceed with variant processing: WT variant not found

shrutikhare-git commented 2 years ago

I am trying to use the pipeline to process raw data of a single mutant library of a protein containing synonymous mutations. I want to get the fitness scores and error estimates. The library design is such that sequences have the form - optional NNN+barcode+primer+gene sequence. I have tried several versions of the following command - "DiMSum --fastqFileDir home/data_files --experimentDesignPath experimentDesign.txt --barcodeDesignPath barcodeDesign.txt --cutadapt5First=GTATGTCAAAAAGATCTGTGC --cutadapt5Second=GGTGTAAACTTTAAACTGCAT --vsearchMinQual 20 --mixedSubstitutions T --sequenceType noncoding --wildtypeSequence ATGATGATG(this is an example sequence)",

I end up getting the same error every time - "Error: Cannot proceed with variant processing: WT variant not found" What does this error indicate? Is there some issue due to the presence of NNNs? Please let me know if there is anything wrong with my command. Thank you.

andrefaure commented 2 years ago

Hi @shrutivijayk , sorry to hear you're getting this error. Could you please share your barcodeDesign.txt file? I think the problem might be the optional NNN before the barcode. DiMSum assumes reads start with barcodes so you will need to enumerate all 1+4^3=65 possibilities in this file i.e. barcode, AAA+barcode, AAC+barcode, AAG+barcode etc. For example if the barcode is GAACGTTC:

pair1 pair2 barcode new_pair_prefix laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz GAACGTTC Input_Rep1_t1_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz AAAGAACGTTC Input_Rep1_t2_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz AACGAACGTTC Input_Rep1_t3_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz AAGGAACGTTC Input_Rep1_t4_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz AATGAACGTTC Input_Rep1_t5_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz ACAGAACGTTC Input_Rep1_t6_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz AGAGAACGTTC Input_Rep1_t7_read laneA_1_sequence.txt.gz laneA_2_sequence.txt.gz ATAGAACGTTC Input_Rep1_t8_read etc.

The alternative would be to demultiplex FastQ files yourself (before running dimsum). Hope this helps!

shrutikhare-git commented 2 years ago

Thanks for your prompt response. I tried specifying all NNN combinations, however, I get the error - Duplicate 'new_pair_prefix' values not allowed in barcodeDesign file. Does this mean that if I specify all possibilities, I need to assign unique names to all the pairs? so they would all be treated as separate replicates? Thank you. (sharing my old barcodeDesign file here) barcodeDesign.txt !

andrefaure commented 2 years ago

Hi @shrutivijayk, yes you need to assign unique names e.g.:

pair1   pair2   barcode new_pair_prefix
puc57CcdALib_1.fastq.gz puc57CcdALib_2.fastq.gz CGGGAA  input_rep1_t1_read
puc57CcdALib_1.fastq.gz puc57CcdALib_2.fastq.gz AAACGGGAA   input_rep1_t2_read
puc57CcdALib_1.fastq.gz puc57CcdALib_2.fastq.gz AACCGGGAA   input_rep1_t3_read
...

and then in the experiment design file you need to include them as follows:

sample_name experiment_replicate    selection_id    selection_replicate technical_replicate pair1   pair2
input1  1   0       1   input_rep1_t1_read1.fastq   input_rep1_t1_read2.fastq
input1  1   0       2   input_rep1_t2_read1.fastq   input_rep1_t2_read2.fastq
input1  1   0       3   input_rep1_t3_read1.fastq   input_rep1_t3_read2.fastq
...

In other words, each distinct sample barcode is associated with a different technical replicate. Obviously you can reuse the technical replicate numbers for different experiment (biological) replicates - as long as the names themselves are unique. Internally, the read counts for different technical replicates of the same experiment (biological) replicate are simply summed.

I know this is clunky and I want to make this simpler in future but hope this helps in the meantime!

shrutikhare-git commented 2 years ago

Thanks again. I tried this and am now getting the error - The following files required for DiMSum STAGE 1 do not exist. ./DiMSum_Project/tmp/0_demultiplex/input_rep3_t3_read1.fastq ./DiMSum_Project/tmp/0_demultiplex/input_rep3_t4_read1.fastq . . Ensure that the 'startStage' argument has been correctly specified and that all previous stages have successfully completed. If you previously ran DiMSum with 'retainIntermediateFiles' set to FALSE, you will need set 'startStage' to 0 in order to regenerate the files required for this stage. Error: Cannot proceed with DiMSum STAGE 1. Required files do not exist. Execution halted

I think some NNN combinations do not exist in my file and hence the file did not get generated which is causing this error. Is that right? Thanks again..

andrefaure commented 2 years ago

Hi @shrutivijayk, yes that seems to be the problem. If you simply remove those missing NNN combinations from the barcode and experiment design files that should solve the issue.

shrutikhare-git commented 2 years ago

Hi, I tried demultiplexing myself and am still getting the error about WT variant not being found. The program is suggesting sequences with some minor mutations compared to my WT sequence. The last few lines of the error are as follows -

Processing merged variants... WT variant not found. Did you mean to specify one of the following? nt_seq 1: TATGAAGCAG...GGACTGGTGAA 2: TATGAAGCAGC...ACTGGTGAA 3: TATGAAGCA...AGAACAGGGACTGGTGAA 4: TATGAAG...ACTGGTGAA 5: TATGAAGC...CTGGTGAA all_reads mean_count 1: TRUE 20 2: TRUE 18 3: TRUE 18 4: TRUE 14 5: TRUE 13 Error: Cannot proceed with variant processing: WT variant not found Execution halted

Can you please suggest what I can try next? Does this mean that DimSum cannot find a single WT sequence in the data? Thank you.

andrefaure commented 2 years ago

Hi @shrutivijayk, sorry to hear you're still having trouble processing your data with DiMSum.

Indeed that error means that not a single WT sequence is observed in the data after trimming constant regions, merging paired-end reads and filtering (according to default/specified parameters).

Did you try a simple manual search (.e.g. grep) for your WT sequence in your raw FastQ files?

shrutikhare-git commented 2 years ago

Yes, I can see multiple WT sequences in the data. I am currently trying different --vsearchMinQual and --vsearchMaxee values hoping to get some output. Any other parameters that I can adjust here? Thanks again!

andrefaure commented 2 years ago

You might also need to look at tweaking the cutadapt parameters if you are not observing a high % of correctly trimmed reads in stage 2 (see corresponding section in the html report).

lehner-lab / DiMSum

Error: Cannot proceed with variant processing: WT variant not found #4