lehner-lab / DiMSum

An error model and pipeline for analyzing deep mutational scanning (DMS) data and diagnosing common experimental pathologies
MIT License
28 stars 6 forks source link

Error in stage 1 #2

Closed ijhoskins closed 3 years ago

ijhoskins commented 3 years ago

Hello,

I'm trying to run the WRAP step but ran into an issue on stage 1. Looks like two issues: 1) object dimsum_meta_new_report not found 2) Temp FASTQs not found

Here is my design file (I have no selection and am just trying to call variants on one sample):

sample_name experiment_replicate    selection_id    selection_replicate technical_replicate pair1   pair2
Input   1   0       1   CBS_read_gen_10k.R1.fq  CBS_read_gen_10k.R2.fq
Output  1   1   1   1   CBS_read_gen_10k.R1.fq  CBS_read_gen_10k.R2.fq

Here is my call:

dimsum --fastqFileDir ~/data_staging/CBS_read_gen_validation_set/ --experimentDesignPath dimsum_config_clean.txt --experimentDesignPairDuplicates T --stranded T --gzipped F -o . -p CBS_read_gen_read_inducer_dimsum --mutagenesisType codon --wildtypeSequence GTGCTCTGCCACGAGACATTGGCCGGCGCCGGCCTCCGCCCCCGCCTCTCCGCCAATCACCGCCCGCCTGCTCCCCTCGCCGTGGGTCCCCGCGAGGCCGCCACCCCGGGGTCGCCGTCTCCGCCTCGCCGCAGTCGGGGCAGCCGCTCGCCCCTCTTTTCCATGTATCCGTCCAGGATCCCATGACAGATTCTGTTGTCACGTCTCCTTACAGAGTTTGAGCGGTGCTGAACTGTCAGCACCATCTGTCCGGTCCCAGCATGCCTTCTGAGACCCCCCAGGCAGAAGTGGGGCCCACAGGCTGCCCCCACCGCTCAGGGCCACACTCGGCGAAGGGGAGCCTGGAGAAGGGGTCCCCAGAGGATAAGGAAGCCAAGGAGCCCCTGTGGATCCGGCCCGATGCTCCGAGCAGGTGCACCTGGCAGCTGGGCCGGCCTGCCTCCGAGTCCCCACATCACCACACTGCCCCGGCAAAATCTCCAAAAATCTTGCCAGATATTCTGAAGAAAATCGGGGACACCCCTATGGTCAGAATCAACAAGATTGGGAAGAAGTTCGGCCTGAAGTGTGAGCTCTTGGCCAAGTGTGAGTTCTTCAACGCGGGCGGGAGCGTGAAGGACCGCATCAGCCTGCGGATGATTGAGGATGCTGAGCGCGACGGGACGCTGAAGCCCGGGGACACGATTATCGAGCCGACATCCGGGAACACCGGGATCGGGCTGGCCCTGGCTGCGGCAGTGAGGGGCTATCGCTGCATCATCGTGATGCCAGAGAAGATGAGCTCCGAGAAGGTGGACGTGCTGCGGGCACTGGGGGCTGAGATTGTGAGGACGCCCACCAATGCCAGGTTCGACTCCCCGGAGTCACACGTGGGGGTGGCCTGGCGGCTGAAGAACGAAATCCCCAATTCTCACATCCTAGACCAGTACCGCAACGCCAGCAACCCCCTGGCTCACTACGACACCACCGCTGATGAGATCCTGCAGCAGTGTGATGGGAAGCTGGACATGCTGGTGGCTTCAGTGGGCACGGGCGGCACCATCACGGGCATTGCCAGGAAGCTGAAGGAGAAGTGTCCTGGATGCAGGATCATTGGGGTGGATCCCGAAGGGTCCATCCTCGCAGAGCCGGAGGAGCTGAACCAGACGGAGCAGACAACCTACGAGGTGGAAGGGATCGGCTACGACTTCATCCCCACGGTGCTGGACAGGACGGTGGTGGACAAGTGGTTCAAGAGCAACGATGAGGAGGCGTTCACCTTTGCCCGCATGCTGATCGCGCAAGAGGGGCTGCTGTGCGGTGGCAGTGCTGGCAGCACGGTGGCGGTGGCCGTGAAGGCCGCGCAGGAGCTGCAGGAGGGCCAGCGCTGCGTGGTCATTCTGCCCGACTCAGTGCGGAACTACATGACCAAGTTCCTGAGCGACAGGTGGATGCTGCAGAAGGGCTTTCTGAAGGAGGAGGACCTCACGGAGAAGAAGCCCTGGTGGTGGCACCTCCGTGTTCAGGAGCTGGGCCTGTCAGCCCCGCTGACCGTGCTCCCGACCATCACCTGTGGGCACACCATCGAGATCCTCCGGGAGAAGGGCTTCGACCAGGCGCCCGTGGTGGATGAGGCGGGGGTAATCCTGGGAATGGTGACGCTTGGGAACATGCTCTCGTCCCTGCTTGCCGGGAAGGTGCAGCCGTCAGACCAAGTTGGCAAAGTCATCTACAAGCAGTTCAAACAGATCCGCCTCACGGACACGCTGGGCAGGCTCTCGCACATCCTGGAGATGGACCACTTCGCCCTGGTGGTGCACGAGCAGATCCAGTACCACAGCACCGGGAAGTCCAGTCAGCGGCAGATGGTGTTCGGGGTGGTCACCGCCATTGACTTGCTGAACTTCGTGGCCGCCCAGGAGCGGGACCAGAAGTGAAGTCCGGAGCGCTGGGCGGTGCGGAGCGGGCCCGCCACCCTTGCCCACTTCTCCTTCGCTTTCCTGAGCCCTAAACACACGCGTGATTGGTAACTGCCTGGCCTGGCACCGTTATCCCTGCACACGGCACAGAGCATCCGTCTCCCCTCGTTAACACATGGCTTCCTAAATGGCCCTGTTTACGGCCTATGAGATGAAATATGTGATTTTCTCTAATGTAACTTCCTCTTAGGATGTTTCACCAAGGAAATATTGAGAGAGAAGTCGGCCAGGTAGGATGAACACAGGCAATGACTGCGCAGAGTGGATTAAAGGCAAAAGAGAGAAGAGTCCAGGAAGGGGCGGGGAGAAGCCTGGGTGGCTCAGCATCCTCCACGGGCTGCGCCGTCTGCTCGGGGCTGAGCTGGCGGGAGCAGTTTGCGTGTTTGGGTTTTTTAATTGAGATGAAATTCAAATAACCTAAAAATCAATCACTTGAAAGTGAACAATCAGCGGCATTTAGTACATCCAGAAAGTTGTGTAGGCACCACCTCTGTCACGTTCTGGAACATTCTGTCATCACCCCGTGAAGCAATCATTTCCCCTCCCGTCTTCCTCCTCCCCTGGCAACTGCTGATCGACTTTGTGTCTCTGTTGTCTAAAATAGGTTTTCCCTGTTCTGGACATTTCATATAAATGGAATCACACAA

DiMSum STAGE 1: ASSESS READ QUALITY Running FASTQC on all files: /Users/ianhoskins/data_staging/CBS_read_gen_validation_set/CBS_read_gen_10k.R1.fq /Users/ianhoskins/data_staging/CBS_read_gen_validation_set/CBS_read_gen_10k.R2.fq Processing... CBS_read_gen_10k.R1.fq CBS_read_gen_10k.R2.fq The following files do not exist. ./CBS_read_gen_read_inducer_dimsum/tmp/1_qualitycontrol/CBS_read_gen_10k.R1.fq ./CBS_read_gen_read_inducer_dimsum/tmp/1_qualitycontrol/CBS_read_gen_10k.R2.fq There were problems while running 'dimsum__fastqc_report' Error in dimsum_stage_fastqc(dimsum_meta = pipeline[["0_demultiplex"]], : object 'dimsum_meta_new_report' not found Calls: dimsum -> dimsum_stage_fastqc In addition: Warning message: In file(con, "r") : cannot open file './CBS_read_gen_read_inducer_dimsum/tmp/1_qualitycontrol/CBS_read_gen_10k.R1.fq': No such file or directory Execution halted

On another note, I was trying to run with --stranded F but get the following error which I wasn't clear on: Error: Unstranded library but sequence of 5' constant regions not found for some samples. Please check that the corresponding experiment design file columns are correct.

My reads were generated in silico and have integer qnames. An example:

@0000001369 RG:Z:CBS_read_gen_10k.dna.r
GGGTTGCTGGCGTTGCGGTACTGGTCTAGGATGTGAGAATTGGGGATTTCGTTCTTCAGCCGCCAGGCCACCCCCACGTGTGACTCCGGGGAGTCGAACCTGGCATTGGTGGGCGTCCTCACAATCTCAGCCCCCAGTGCCCGCAGCACG
+
SSSSSSSSSSOKJNRRORMSNKRRLLSQLSIKQSSOPKQOQSQMSPLKOIPLMMNNJOOJILNPPMRSLLLIIJNSSKSPQJOKKJMJKRINMRPJLJJIRNKQJKPKLIJJMIOJQSKOLPQLJRQOPKQQMROKMPPRMRISSSSSSS
andrefaure commented 3 years ago

Hi Ian,

Please try rerunning including the correct fastq file extension of the supplied input files i.e.

--fastqFileExtension ".fq"

Regarding the error you obtained with --stranded F, this option is only relevant if you supply constant regions, i.e. the typically part of the sequence not subjected to mutagenesis but required for primer binding and isolation/amplification of variables regions. With --stranded set to FALSE DiMSum will then determine the strandedness of paired reads from the sequence i.e. mate pairs are switched if necessary.

You can find a description of the relevant options here: https://github.com/lehner-lab/DiMSum/blob/master/docs/ARGUMENTS.md#trim-arguments

Seeing as your reads are generated in silico it should be trivial to add such sequences to the beginning of the simulated reads. Be sure to then specify these accordingly with e.g. --cutadapt5First and --cutadapt5Second.

ijhoskins commented 3 years ago

Hi @andrefaure thank you for your reply and recommendations. I included --fastqFileExtension ".fq" which resolved 2); but I am still seeing error for 1) above:

There were problems while running 'dimsum__fastqc_report' Error in dimsum_stage_fastqc(dimsum_meta = pipeline[["0_demultiplex"]], : object 'dimsum_meta_new_report' not found Calls: dimsum -> dimsum_stage_fastqc Execution halted

I am not fully clear on how the --stranded option has to do with constant regions. My in silico read pairs have R1s align to both + and - strands. Are you saying I should run with --stranded FALSE and include constant regions on 5' ends of R1 and R2 for this to properly enumerate variant counts?

Thank you!

andrefaure commented 3 years ago

Hi @ijhoskins

I have managed to reproduce this error and will fix the associated bug in the next update. It seems to be due to the behaviour of FastQC (a dependency of DiMSum) on low numbers of reads or those with low base quality diversity at certain positions. This may or may not also be a peculiarity of your simulated reads.

Yes, the --stranded option will use the supplied constant regions to determine the orientation of the sequenced fragment for each read pair.

However, perhaps you could explain a bit more what you are trying to achieve by using DiMSum? The pipeline may not be suitable for your purposes. It appears that you are expecting DiMSum to align your reads to your supplied WT (reference) sequence. Is this correct?

Rather, DiMSum simply aligns read pairs to each other using VSEARCH: https://github.com/torognes/vsearch

A typical DMS experiment involves characterising variants by sequencing the entire variable/mutated region (or corresponding barcodes thereof).

ijhoskins commented 3 years ago

Hi @andrefaure yes my simulated reads have the first and last ~10 bp with the same base quality, which may be what's causing the problem with FastQC.

I think you are right in that I misunderstood what DiMSum is doing. Do you mean you do pairwise alignments of each read to the wild-type sequence (as opposed to pairwise alignment of the paired reads)? Otherwise I am not sure what constitutes a variant if it is not relative to some reference. In any case, my inputs contain multiple interleaved amplicon tiles across a large mutagenized target (see Sun and Weile et al. 2017, 2020). If I understand you correctly DiMSum expects reads from a single amplicon? If that is the case I could probably still utilize DiMSum but I would have to partition my reads into amplicon-specific bins.

andrefaure commented 3 years ago

Hi @ijhoskins DiMSum does indeed expect reads from a single amplicon and in the case of PE sequencing does pairwise alignment of the read pairs. Variants are called by comparison of the resulting alignments to the supplied WT sequence.

So yes you could either: (1) partition your reads into amplicon-specific bins prior to running DiMSum or (2) simply run DiMSum on all reads but with settings that effectively restrict reads to a given amplicon of interest e.g. "--wildtypeSequence" corresponding to the amplicon of interest, "--indels = none" (the default) and "--maxSubstitutions" chosen to filter out all but the amplicon of interest. I expect you would also need to specify amplicon-specific constant regions.

Hope this helps.

andrefaure commented 3 years ago

This bug has now been fixed in version 1.2.7.

SamRamesh commented 2 years ago

Hi there! I've been trying to process paired end reads using the software and have been running into error 1) described in this thread where the fastqc has failed. I was wondering if you could check if the bug has reappeared in the current version of DiMSum (1.2.9)? Thanks!

andrefaure commented 2 years ago

Hi @SamRamesh, in order to help resolve this issue could you please share your call to dimsum (+arguments) as well as the resulting output and error?

ijhoskins commented 2 years ago

Hi @SamRamesh and @andrefaure I note that I was able to reproduce this error recently and successfully ran starting at the first stage.

andrefaure commented 2 years ago

Thanks @ijhoskins, so the fix was rerunning from stage 1?

@SamRamesh Does this fix the issue? Rerunning from the first stage is indeed a good idea after upgrading dimsum.

SamRamesh commented 2 years ago

Thanks @andrefaure and @ijhoskins for the tip, however it didn't fix the issue for me! After a bit of playing around, I believe the issue was with the program not being able to read my input fastq files.

The files were zipped with a .fastq.gz extension, so I first attempted to run it including the argument --fastqFileExtension '.fastq.gz', as the documentation advised the default was '.fastq', but got the error: Error: Invalid 'fastqFileExtension' argument (only alphanumeric characters allowed)

I then tried changing the names of my input files to .fastq instead (but without actually unzipping), but this gave me the missing fastqc files error again.

Finally running it with the files in the original .fastq.gz name and without including a --fastqFileExtension argument let the program run to completion 😊

andrefaure commented 2 years ago

Glad to hear you managed to resolve your problems @SamRamesh. Indeed in your case the defaults for --fastqFileExtension (".fastq") and --gzipped ("T") are correct so you needn't specify/change them.

I will look at removing these options entirely in future versions as they can be detected automatically and this should help to avoid confusion.

suzmcder commented 1 year ago

Hi @andrefaure, I'm running into the same issue as @SamRamesh. Their fix (running it with the files in the original .fastq.gz format, and named as such in the design file, but without including a --fastqFileExtension or --gzipped argument) hasn't worked for me-wondering if the bug has reappeared? Many thanks!

suzmcder commented 1 year ago

Ps. here is my output:

Running DiMSum pipeline Package version 1.3 R version R version 4.1.0 (2021-05-18) Binary dependency versions cutadapt 4.3 FastQC FastQC v0.12.1 Pandoc pandoc 1.12.3.1 VSEARCH vsearch v2.22.1_linux_x86_64 starcode starcode-v1.4 2020-11-02 DiMSum STAGE 0 (WRAP): DEMULTIPLEX READS Skipping this stage (assuming all fastq files already demultiplexed) DiMSum STAGE 1 (WRAP): ASSESS READ QUALITY Running FASTQC on all files: /home/smcde1/BF/data/R1D0_S1_L001_R1_001.fastq.gz /home/smcde1/BF/data/R1D4_S3_L001_R1_001.fastq.gz /home/smcde1/BF/data/R2D0_S2_L001_R1_001.fastq.gz /home/smcde1/BF/data/R2D4_S5_L001_R1_001.fastq.gz /home/smcde1/BF/data/R1D0_S1_L001_R2_001.fastq.gz /home/smcde1/BF/data/R1D4_S3_L001_R2_001.fastq.gz /home/smcde1/BF/data/R2D0_S2_L001_R2_001.fastq.gz /home/smcde1/BF/data/R2D4_S5_L001_R2_001.fastq.gz Processing... R1D0_S1_L001_R1_001.fastq.gz R1D4_S3_L001_R1_001.fastq.gz R2D0_S2_L001_R1_001.fastq.gz R2D4_S5_L001_R1_001.fastq.gz R1D0_S1_L001_R2_001.fastq.gz R1D4_S3_L001_R2_001.fastq.gz R2D0_S2_L001_R2_001.fastq.gz R2D4_S5_L001_R2_001.fastq.gz There were problems while running 'dimsumfastqc_report' Error in dimsum_stage_fastqc(dimsum_meta = pipeline[["0_demultiplex"]], : object 'dimsum_meta_new_report' not found In addition: Warning messages: 1: In dimsumfastqc_report(dimsum_meta = dimsum_meta_new, report_outpath = report_outpath) : NAs introduced by coercion 2: In lapply(lapply(temp_out_data, "[", -1), as.numeric) : NAs introduced by coercion 3: In dimsumfastqc_report(dimsum_meta = dimsum_meta_new, report_outpath = report_outpath) : NAs introduced by coercion 4: In lapply(lapply(temp_out_data, "[", -1), as.numeric) : NAs introduced by coercion 5: In dimsumfastqc_report(dimsum_meta = dimsum_meta_new, report_outpath = report_outpath) : NAs introduced by coercion 6: In lapply(lapply(temp_out_data, "[", -1), as.numeric) : NAs introduced by coercion 7: In dimsum__fastqc_report(dimsum_meta = dimsum_meta_new, report_outpath = report_outpath) : NAs introduced by coercion 8: In lapply(lapply(temp_out_data, "[", -1), as.numeric) : NAs introduced by coercion

andrefaure commented 1 year ago

Hi @suzmcder , there is likely a problem with your FASTQ files, for example they might be corrupted or truncated.

Please check the FASTQC standard output and error in the following files:

YOUR_PROJECT_NAME/tmp/1_qualitycontrol/fastqc.stdout YOUR_PROJECT_NAME/tmp/1_qualitycontrol/fastqc.stderr

and paste here any errors that you encounter.

suzmcder commented 1 year ago

Hi @andrefaure, thanks for your reply! I have an update-I had previously ran some FASTQ files with an older version (1.2.7-0), and so I went back to run my current files with that version, and everything ran just fine.

andrefaure commented 1 year ago

Hi @suzmcder,

I managed to trace this problem to the latest version of FastQC (version 0.12) which is a dependency of DiMSum.

To get around this you can either:

[1] simply downgrade to FastQC 0.11 in the dimsum conda env:

conda activate dimsum
conda install -c bioconda fastqc=0.11

or

[2] delete the dimsum conda env and do a fresh install with FastQC 0.11:

conda remove --name dimsum --all
conda create -n dimsum r-base=4.0 fastqc=0.11 r-dimsum

Let me know if this works for you.

suzmcder commented 1 year ago

Hi @andrefaure, I tried option 2 and this worked-thanks so much for your help!