Open rmcolq opened 5 years ago
Hi Rachel, thanks for the report! I completely agree that this needs to be fixed.
For anyone coming across this, I found a workaround.
I was working with a small toy reference 300bp long. If run with default parameters it hangs forever due to the length of one of the simulated reads being longer than my actual reference. So if you set --max-len
to be the length of your reference then it works fine.
@karel-brinda I wonder if max_readlength should be set by default to the length of the reference rather than infinity? https://github.com/karel-brinda/NanoSim-H/blob/1e2d374585903ceea50de0ea1be05da1ef869514/nanosimh/simulate.py#L785
It doesn't make sense to be able to generate reads longer than the reference?
Actually, I have been digging into this a little more and found something interesting. If I indeed set --max-len
to the length of the reference (300) it gets part of the way through and then hangs. If I take that down to about 290 it runs fine.
But if I also set --min-len
to the same as --max-len
then it just hangs.
However, if I make --min-len
1 less than --max-len
it has an ETA of 2 hours for 1000 reads - this seems to be expected though as mentioned in the nanosim README
If you want the fasta file I am working with I'll attach it below (it says it's a .txt file but it is actually fasta).
Pretty sure there's a bug in extract_read
. In the use-case where your reference sequence is a single short non-circular chromosome (perhaps because the "chromosome" is really a PCR amplicon you want to simulate reads of), it is highly likely that get_length
will return some values greater than your sequence length. When extract_read
gets called with such a length
value, length
gets set to max(seq_len.values())
which in turn equals genome_len
. Then we repeatedly try to generate a read start position with this line:
ref_pos = random.randint(0, genome_len)
Then we hit this if
/elif
/else
block:
if ref_pos + length < seq_len[key]:
new_read = seq_dict[key][ref_pos:ref_pos + length]
read_info = (key, ref_pos)
break
elif ref_pos < seq_len[key]:
break
else:
ref_pos -= seq_len[key]
But even in the best case where ref_pos
is 0, ref_pos + length < seq_len[key]
is guaranteed to be false in the single-short-chromosome scenario because length equals seq_len[key]. So new_read
never gets set and we stay in the while True:
loop generating new random start positions forever.
I'll have a fiddle around and see if there's an easy fix.
https://github.com/karel-brinda/NanoSim-H/pull/16 fixes the hang. However, when running against a short reference with that fix applied, I now see this crash instead:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/bin/nanosim-h", line 11, in <module>
load_entry_point('NanoSim-H', 'console_scripts', 'nanosim-h')()
File "/Users/markamery/NanoSim-H/nanosimh/__init__.py", line 19, in nanosimh_simulate
simulate.main()
File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 1010, in main
min_l=min_readlength, merge=merge, rnf=rnf, rnf_cigar=rnf_cigar
File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 455, in simulation
read_mutated, cigar = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 702, in mutate_read
tmp_bases.remove(read[key + i])
IndexError: string index out of range
I'll keep debugging. :)
I've pushed another commit to #16 to fix the error mentioned above.
I think this is fixed on devel and can be closed?
Not yet released to master or PyPI, though.
It's included in the devel
branch, but not yet in master (sorry for the delay).
You can install the devel version by:
pip install git+https://github.com/karel-brinda/NanoSim-H@devel
I realise this is not the typical use case, but I was trying to simulate reads with the error profile of nanopore and did not care about the length. I ran the example case just fine, but found that when I simulated reads from a short sequence (2000bps) this caused nanosim-h to start and just hang until my server timed out. Adding 15,000 "A"s to the beginning and end of my short sequence enabled it to work. A sanity check on user input would be a good idea.