hasindu2008 / squigulator

a tool for simulating nanopore raw signal data
https://hasindu2008.github.io/squigulator
MIT License
61 stars 3 forks source link

Program hangs on short reference #2

Closed wvdtoorn closed 1 year ago

wvdtoorn commented 1 year ago

Hi @hasindu2008.

This is a great tool, thanks for your efforts!

I am trying to simulate some very short sequences (50 bases). When running the tool with standard settings (and n=100), the program keeps hanging after outputting [INFO] sim_main: Using random seed: xxxxxx. An empty blow5 outfile is created, but no results are written into the file.

When running the program with the test/nCoV-2019.reference.fasta, everything works.

Do you have any ideas on how I could get your program to work for short sequences?

Thanks a lot!

hasindu2008 commented 1 year ago

Seems like a bug. Could you tell me the the exact command line you used?

wvdtoorn commented 1 year ago

I'm using ./squigulator barcodes.fa -x dna-r9-min -o reads.blow5 -n 800, with barcodes.fa containing 8 50nt sequences

I think I tracked it down to the interaction between https://github.com/hasindu2008/squigulator/blob/465f544829bbd9731a2d6b523d9b1df4f0b5184a/src/sim.c#L758-L759 and https://github.com/hasindu2008/squigulator/blob/465f544829bbd9731a2d6b523d9b1df4f0b5184a/src/sim.c#L810-L811

hasindu2008 commented 1 year ago

Ahh yes. I used this 200 cutoff, which MinKNOW also uses unless the short-read mode is enabled. I could make this a parameter. Will that do? And also do you want to simulate whole barcodes or randomly sample the barcodes?

wvdtoorn commented 1 year ago

Yes, I think that'll do, thanks! I would like to simulate the whole barcode. I now hard coded some things in the code to get it to work for my specific use case but would be great to have as an option of course.

For anyone reading along in the mean time: dirty fix was altering the gen_read_dna funciton to always do

*ref_pos=0;
len = ref->ref_lengths[seq_i];
*c = '+';
hasindu2008 commented 1 year ago

I think you should be using the --full-contigs option. Otherwise, it will treat the barcode.fa as a genome and randomly sample. squigulator barcodes.fa -x dna-r9-min -o reads.blow5 --full-contigs

If I remember right, this will not go into the random read sampler (which has that 200 cut-off). Instead, the whole thing will be simulated. The only thing is it will simulate each barcode in barcodes.fa just once. You can create a new barcode.fa file with 100 copies of each barcode and provide this to squigulator.

wvdtoorn commented 1 year ago

Ah, smart. I tried the --full-contigs before, but moved on after ending up with only one sequence per bc. Thanks for your (fast) help!

hasindu2008 commented 1 year ago

Because a simple bash loop something like below can do this, I was lazy to implement such a feature to --full-contigs options.

for i in $(seq 1 100)
do
cat  barcode.fa  >> replicated_barcoded.fa
done
hasindu2008 commented 1 year ago

@wvandertoorn I have added a warning when it detects too short references so it is clear what is happening. Thanks for reporting. If you have any issues, feel free to open this or another issue.