hasindu2008 / squigulator

a tool for simulating nanopore raw signal data
https://hasindu2008.github.io/squigulator
MIT License
58 stars 3 forks source link

seed is not working as expected? #16

Closed hiruna72 closed 3 months ago

hiruna72 commented 3 months ago

On dev branch when I create two slow5 files; one with preset -x rna-r9-prom and one without preset but providing all the arguments --digitisation 2048 --sample-rate 3000 --range 548.788269 --offset-mean -231.9440589 --offset-std 12.87185278 --median-before-mean 238.5286796 --median-before-std 21.1871794 --dwell-mean 43.0 --dwell-std 35.0 I get two different simulated regions even though I provide the same --seed 100. Is this expected?

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ git checkout dev
Switched to branch 'dev'
Your branch is up to date with 'origin/dev'.
hirsam@genometech-gpgpu:/data/hiruna/squigulator$ make clean && make -j

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ git log | head
commit 7318deb1ad65796fb72b15392e782e093a64e5e2
Author: Hasindu Gamaarachchi <hasindu2008@gmail.com>
Date:   Fri Apr 12 21:36:54 2024 +1000

    set the parameters in the rna004-min profile

commit 1a13e09bf3dab509b1f940657c2b1a119c65813e
Author: Hasindu Gamaarachchi <hasindu2008@gmail.com>
Date:   Fri Apr 12 21:14:21 2024 +1000

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ ./squigulator --seed 100 -x rna-r9-prom /genome/gencode.v40.transcripts.fa -o b.slow5 -n 1 
[init_core::INFO] builtin RNA R9 nucleotide model loaded
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 3000.0; range: 548.8; offset_mean: -231.9; offset_std: 12.9; dwell_mean: 43.0; dwell_std: 35.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 -x rna-r9-prom -o b.slow5 -n 1 /genome/gencode.v40.transcripts.fa
[main] Real time: 0.952 sec; CPU time: 0.955 sec; Peak RAM: 0.432 GB

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ ./squigulator --seed 100  --kmer-model r9.4_70bps.u_to_t_rna.5mer.template.model /genome/gencode.v40.transcripts.fa -o c.slow5 -n 1 --digitisation 2048 --sample-rate 3000 --range 548.788269 --offset-mean -231.9440589 --offset-std 12.87185278 --median-before-mean 238.5286796 --median-before-std 21.1871794 --dwell-mean 43.0 --dwell-std 35.0
[check_args::WARNING] Option --dwell-mean is ignored when --sample-rate is provided At src/sim.c:759
[INFO] sim_main: dwell_mean=43.0,sample_rate=3000.0,set bps=sample_rate/dwell_mean=70.0
[INFO] sim_main: bps=70.0,sample_rate=3000.0,set dwell_mean=sample_rate/bps=43.0
[read_model::WARNING] Reading the model from file r9.4_70bps.u_to_t_rna.5mer.template.model. This is an experimental feature. Use with caution At src/model.c:44
[read_model::INFO] k-mer size in file r9.4_70bps.u_to_t_rna.5mer.template.model is 5
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 3000.0; range: 548.8; offset_mean: -231.9; offset_std: 12.9; dwell_mean: 43.0; dwell_std: 35.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 --kmer-model r9.4_70bps.u_to_t_rna.5mer.template.model -o c.slow5 -n 1 --digitisation 2048 --sample-rate 3000 --range 548.788269 --offset-mean -231.9440589 --offset-std 12.87185278 --median-before-mean 238.5286796 --median-before-std 21.1871794 --dwell-mean 43.0 --dwell-std 35.0 /genome/gencode.v40.transcripts.fa
[main] Real time: 0.959 sec; CPU time: 0.961 sec; Peak RAM: 0.432 GB

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ cut -f 1-7 c.slow5 | tail -n 1
S1_1!ENST00000651234.1|ENSG00000188157.15|OTTHUMG00000040778.7|OTTHUMT00000099680.4|AGRN-209|AGRN|7477|protein_coding|!833!4811!-       0       2048    -252.635553     548.788269      3000    182458

hirsam@genometech-gpgpu:/data/hiruna/squigulator$ cut -f 1-7 b.slow5 | tail -n 1
S1_1!ENST00000448179.1|ENSG00000230699.2|OTTHUMG00000040716.1|OTTHUMT00000097856.1|ENST00000448179|ENSG00000230699|3043|lncRNA|!0!3043!+        0       2048    -252.635553     548.788269      3000    139436
hasindu2008 commented 3 months ago

On commit 7318deb1ad65796fb72b15392e7:

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ ./squigulator --seed 100 -x rna-r9-prom /mnt/d/genome//gencode.v40.transcripts.fa -o a.slow5 -n 1
[init_core::INFO] builtin RNA R9 nucleotide model loaded
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 3000.0; range: 548.8; offset_mean: -231.9; offset_std: 12.9; dwell_mean: 43.0; dwell_std: 35.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 -x rna-r9-prom -o a.slow5 -n 1 /mnt/d/genome//gencode.v40.transcripts.fa
[main] Real time: 0.586 sec; CPU time: 0.203 sec; Peak RAM: 0.430 GB

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ ./squigulator --seed 100 -x rna-r9-prom /mnt/d/genome//gencode.v40.transcripts.fa -o b.slow5 -n 1 -t 1
[init_core::INFO] builtin RNA R9 nucleotide model loaded
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 3000.0; range: 548.8; offset_mean: -231.9; offset_std: 12.9; dwell_mean: 43.0; dwell_std: 35.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 -x rna-r9-prom -o b.slow5 -n 1 -t 1 /mnt/d/genome//gencode.v40.transcripts.fa
[main] Real time: 0.593 sec; CPU time: 0.156 sec; Peak RAM: 0.430 GB

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ diff a.slow5 b.slow5
hasindu2008 commented 3 months ago

Does your diff pass?

hiruna72 commented 3 months ago

yes

hasindu2008 commented 3 months ago

Seems like it works for me, even with different options and k-mer models

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ ./squigulator --seed 100 /mnt/d/genome/gencode.v40.transcripts.fa -o a.slow5 -n 1 --kmer-model ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model
[read_model::WARNING] Reading the model from file ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model. This is an experimental feature. Use with caution At src/model.c:44
[read_model::INFO] k-mer size in file ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model is 5
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 4000.0; range: 748.6; offset_mean: -237.4; offset_std: 14.2; dwell_mean: 9.0; dwell_std: 4.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 -o a.slow5 -n 1 --kmer-model ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model /mnt/d/genome/gencode.v40.transcripts.fa
[main] Real time: 1.372 sec; CPU time: 0.594 sec; Peak RAM: 0.429 GB

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ ./squigulator --seed 100 /mnt/d/genome/gencode.v40.transcripts.fa -o b.slow5 -n 1 --digitisation 2048 --sample-rate 3000 --range 548.788269 --offset-mean -231.9440589 --offset-std 12.87185278 --median-before-mean 238.5286796 --median-before-std 21.1871794 --dwell-mean 43.0 --dwell-std 35.0 --kmer-model ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model
[check_args::WARNING] Option --dwell-mean is ignored when --sample-rate is provided At src/sim.c:759
[INFO] sim_main: dwell_mean=43.0,sample_rate=3000.0,set bps=sample_rate/dwell_mean=70.0
[INFO] sim_main: bps=70.0,sample_rate=3000.0,set dwell_mean=sample_rate/bps=43.0
[read_model::WARNING] Reading the model from file ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model. This is an experimental feature. Use with caution At src/model.c:44
[read_model::INFO] k-mer size in file ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model is 5
[INFO] load_ref: Loaded 246624 reference sequences with total length 417.726536 Mbases
[INFO] print_model_stat: digitisation: 2048.0; sample_rate: 3000.0; range: 548.8; offset_mean: -231.9; offset_std: 12.9; dwell_mean: 43.0; dwell_std: 35.0
[INFO] sim_main: 1/1 reads done
[main] Version: 0.3.0-dirty
[main] CMD: ./squigulator --seed 100 -o b.slow5 -n 1 --digitisation 2048 --sample-rate 3000 --range 548.788269 --offset-mean -231.9440589 --offset-std 12.87185278 --median-before-mean 238.5286796 --median-before-std 21.1871794 --dwell-mean 43.0 --dwell-std 35.0 --kmer-model ../f5c/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model /mnt/d/genome/gencode.v40.transcripts.fa
[main] Real time: 0.865 sec; CPU time: 0.469 sec; Peak RAM: 0.430 GB

hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$
hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$
hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ slow5tools skim a.slow5
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      channel_number  median_before   read_number     start_mux       start_time
S1_1!ENST00000651234.1|ENSG00000188157.15|OTTHUMG00000040778.7|OTTHUMT00000099680.4|AGRN-209|AGRN|7477|protein_coding|!833!4811!-       0       2048    -260.168371     748.5801        4000    35526   .       0239.864613      0       0       0

[main] cmd: slow5tools skim a.slow5
[main] real time = 0.002 sec | CPU time = 0.000 sec | peak RAM = 0.002 GB
hasindu@hasindu-3660:/mnt/d/hasindu2008.git/squigulator$ slow5tools skim b.slow5
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      channel_number  median_before   read_number     start_mux       start_time
S1_1!ENST00000651234.1|ENSG00000188157.15|OTTHUMG00000040778.7|OTTHUMT00000099680.4|AGRN-209|AGRN|7477|protein_coding|!833!4811!-       0       2048    -252.635553     548.788269      3000    182458  .       0268.611434      0       0       0

[main] cmd: slow5tools skim b.slow5
[main] real time = 0.006 sec | CPU time = 0.000 sec | peak RAM = 0.003 GB
hiruna72 commented 3 months ago

As we realised if -x is not provided the default profile r9-dna is chosen. this can cause the reads to be different depending on the DNA/RNA specs. It is a good idea to print the profile to stderr.