liyu95 / DeepSimulator

The first deep learning based Nanopore simulator which can simulate the process of Nanopore sequencing.
115 stars 40 forks source link

[feature suggestion] adding an option to specify desired read coverage? #13

Closed yjx1217 closed 5 years ago

yjx1217 commented 5 years ago

Hello,

Would it be possible to add an option for specifying desired read coverage for simulation? Currently, there is only an option to specify the exact number of reads for simulation. I think a complementary option for specifying coverage will be very helpful in many simulation settings. Thanks for consideration!

Best, Jia-Xing

evgisa commented 5 years ago

Hi! According to the paper (https://academic.oup.com/bioinformatics/article/34/17/2899/4962495):

To run the simulator, the user just need to input a reference genome or assembled contigs, specifying the coverage or the number of reads.

So, probably this parameter already exits. However, I'm not sure which one it is even after reading the supplementary documentation. I'd also really appreciate if someone could clarify it. Have a good day!

yjx1217 commented 5 years ago

Hi @evgisa ,

Thanks for the info. I've checked the full option list (see below) and there is no option correspond to coverage. For now, I wrote my own wrapper to calculate the number of reads needed based on the input genome size, my specified coverage, and an mean read length of 4400 bp (measured by two independent runs). It seems to work well. But it will be good if the developers can provide a direct option for specifying coverage.

> ./deep_simulator.sh 
DeepSimulator v0.21 [Mar-14-2019] 
    A Deep Learning based Nanopore simulator which can simulate the process of Nanopore sequencing. 

USAGE:  ./deep_simulator.sh <-i input_genome> [-n simu_read_num] [-o out_root] [-c CPU_num] [-m sample_mode] [-M simulator] 
                [-C cirular_genome] [-u tune_sampling] [-e event_std] [-f filter_freq] [-s noise_std] [-P perfect] [-H home] 
Options:

***** required arguments *****
-i input_genome   : input genome in FASTA format. 

***** optional arguments *****
-n simu_read_num  : the number of reads need to be simulated. [default = 100] 
                    Set -1 to simulate the whole input sequence without cut (not suitable for genome-level). 

-o out_root       : Default output would the current directory. [default = './${input_name}_DeepSimu'] 

-c CPU_num        : Number of processors. [default = 8]

-m sample_mode    : choose from the following distribution for the read length. [default = 3] 
                    1: beta_distribution, 2: alpha_distribution, 3: mixed_gamma_dis. 

-M simulator      : choose either context-dependent(0) or context-independent(1) simulator. [default = 1] 

-C cirular_genome : 0 for linear genome and 1 for circular genome. [default = 0] 

-u tune_sampling  : 1 for tuning sampling rate to around eight and 0 for not. [default = 1] 

-e event_std      : set the standard deviation (std) of the random noise of the event. [default = 1.0] 

-f filter_freq    : set the frequency for the low-pass filter. [default = 850] 

-s noise_std      : set the standard deviation (std) of the random noise of the signal. [default = 1.5] 
                    '1.0' would give the base-calling accuracy around 92\%, 
                    '1.5' would give the base-calling accuracy around 90\%, 
                    '2.0' would give the base-calling accuracy around 85\%, 

-P perfect        : 0 for normal mode (with length repeat and random noise). [default = 0]
                    1 for perfect context-dependent pore model (without length repeat and random noise). 
                    2 for generating almost perfect reads without any randomness in signals (equal to -e 0 -f 0 -s 0). 

-H home           : home directory of DeepSimulator. [default = 'current directory'] 

Best, Jia-Xing

liyu95 commented 5 years ago

Hi, Jia-Xing, thank you very much for the suggestion!

The enhancement has been implemented. Feel free to check and criticize.