makovalab-psu / NoiseCancellingRepeatFinder

Noise-Cancelling Repeat Finder
MIT License
24 stars 4 forks source link

Running on genome assemblies #8

Open mccartneya90 opened 4 years ago

mccartneya90 commented 4 years ago

Heya,

I'm going to give this a whirl on a plant HiC assembly I have but I get the error message :

failed to re-allocate 131,158,140,720 bytes for 8,743,876,048 DP cells

Could you suggest what I should put for the following parameters? I'm working on a fairly decent cluster so memory is available.

--allocate:aligner= suggest space for aligner data structures --allocate:sequence= suggest space for sequence nucleotides --allocate:sequencename= suggest space for sequence name --allocate:debridger= suggest space for debridger data structures (each stack entry requires 32 bytes) --allocate:clump= suggest space for error clump data structures

Thank you, A

rsharris commented 4 years ago

The short answer is that this is a use case that I did not envision during design. But I think it's a use case people want, or expect. So I need to try to address it, but also indicate some of the caveats.

First, the code has steps that assume the input is noisy reads. These can have a negative affect when the assumption isn't true. Specifically, we assume errors in reads are independent. So if an alignment contains segments with error density much higher than average, this is assumed to be a false positive, the segment is not an example of the repeat, and such segments are excised. This can turn a long repeat into a bunch of shorter ones, some of which may be discarded because they are too short. While that assumption is reasonable for reads, it is not reasonable for assembled genomes — the error model is different.

Secondly, I think NcRF might be very slow on a search of this scale. I'm not sure of that, but the design target was reads, typically tens of kilobases in length, or a couple hundred K tops. The allocation being requested is ≈175 times what was envisioned as worst case.

Let me know what you are trying to do, and I'll see how this might be accomplished. If you aren't comfortable discussing that in public (i.e. in this issue), feel free to email me rsharris at bx dot psu dot edu.