karel-brinda / NanoSim-H

NanoSim-H: a simulator of Oxford Nanopore reads; a fork of NanoSim.
Other
17 stars 6 forks source link

A short "genome" sequence causes a hang, not a fail #9

Open rmcolq opened 5 years ago

rmcolq commented 5 years ago

I realise this is not the typical use case, but I was trying to simulate reads with the error profile of nanopore and did not care about the length. I ran the example case just fine, but found that when I simulated reads from a short sequence (2000bps) this caused nanosim-h to start and just hang until my server timed out. Adding 15,000 "A"s to the beginning and end of my short sequence enabled it to work. A sanity check on user input would be a good idea.

nanosim-h -p ecoli_R9_2D -n 500 random_path.fa --unalign-rate 0 --max-len 10000
karel-brinda commented 5 years ago

Hi Rachel, thanks for the report! I completely agree that this needs to be fixed.

mbhall88 commented 5 years ago

For anyone coming across this, I found a workaround.
I was working with a small toy reference 300bp long. If run with default parameters it hangs forever due to the length of one of the simulated reads being longer than my actual reference. So if you set --max-len to be the length of your reference then it works fine.

@karel-brinda I wonder if max_readlength should be set by default to the length of the reference rather than infinity? https://github.com/karel-brinda/NanoSim-H/blob/1e2d374585903ceea50de0ea1be05da1ef869514/nanosimh/simulate.py#L785

It doesn't make sense to be able to generate reads longer than the reference?

mbhall88 commented 5 years ago

Actually, I have been digging into this a little more and found something interesting. If I indeed set --max-len to the length of the reference (300) it gets part of the way through and then hangs. If I take that down to about 290 it runs fine.
But if I also set --min-len to the same as --max-len then it just hangs.

However, if I make --min-len 1 less than --max-len it has an ETA of 2 hours for 1000 reads - this seems to be expected though as mentioned in the nanosim README

If you want the fasta file I am working with I'll attach it below (it says it's a .txt file but it is actually fasta).

test.txt

ExplodingCabbage commented 5 years ago

Pretty sure there's a bug in extract_read. In the use-case where your reference sequence is a single short non-circular chromosome (perhaps because the "chromosome" is really a PCR amplicon you want to simulate reads of), it is highly likely that get_length will return some values greater than your sequence length. When extract_read gets called with such a length value, length gets set to max(seq_len.values()) which in turn equals genome_len. Then we repeatedly try to generate a read start position with this line:

ref_pos = random.randint(0, genome_len)

Then we hit this if/elif/else block:

if ref_pos + length < seq_len[key]:
    new_read = seq_dict[key][ref_pos:ref_pos + length]
    read_info = (key, ref_pos)
    break
elif ref_pos < seq_len[key]:
    break
else:
    ref_pos -= seq_len[key]

But even in the best case where ref_pos is 0, ref_pos + length < seq_len[key] is guaranteed to be false in the single-short-chromosome scenario because length equals seq_len[key]. So new_read never gets set and we stay in the while True: loop generating new random start positions forever.

I'll have a fiddle around and see if there's an easy fix.

ExplodingCabbage commented 5 years ago

https://github.com/karel-brinda/NanoSim-H/pull/16 fixes the hang. However, when running against a short reference with that fix applied, I now see this crash instead:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/bin/nanosim-h", line 11, in <module>
    load_entry_point('NanoSim-H', 'console_scripts', 'nanosim-h')()
  File "/Users/markamery/NanoSim-H/nanosimh/__init__.py", line 19, in nanosimh_simulate
    simulate.main()
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 1010, in main
    min_l=min_readlength, merge=merge, rnf=rnf, rnf_cigar=rnf_cigar
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 455, in simulation
    read_mutated, cigar = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 702, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

I'll keep debugging. :)

ExplodingCabbage commented 5 years ago

I've pushed another commit to #16 to fix the error mentioned above.

ExplodingCabbage commented 4 years ago

I think this is fixed on devel and can be closed?

Not yet released to master or PyPI, though.

karel-brinda commented 4 years ago

It's included in the devel branch, but not yet in master (sorry for the delay).

You can install the devel version by:

pip install git+https://github.com/karel-brinda/NanoSim-H@devel