rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

Adding a Graph::clear() function #17

Closed paoloczi closed 3 years ago

paoloczi commented 5 years ago

I find that doing a large number of alignments in the same process leads to memory fragmentation and eventual memory allocation failure. This problem could be eliminated easily by simply providing a Graph::clear() function that clears all of the vectors stored in the Graph and allows a Graph object to be reused. Vector::clear() does not free the memory, it just sets the size to zero. That way, the vectors capacities would grow to the capacity needed by the largest alignment, but without causing memory fragmentation, because of the small number of vectors involved.

This is easy to implement and I could just create a fork and do it, but I would prefer if this is done in the main repository. I am using the spoa library in the Shasta assembler.

Without this functionality, a large run needs to be split among small separate processes, and multithreaded parallelism results in intolerable memory fragmentation (each thread using separate Graphs, of course).

paoloczi commented 5 years ago

I just noticed that ther eare also vectors in the Node and Edge objects. Because of that, just adding a Graph::clear() might not be sufficient to eliminate the fragmentation issue.

rvaser commented 5 years ago

Hi Paolo, I understand your concerns but I have to refactor the whole graph to enable proper reusage of allocated memory and at this moment I am not sure what is the best way to do so. If you have any ideas, please let me know. I'll mark this as an enhancement until I handle this.

Best regards, Robert

paoloczi commented 5 years ago

I was just thinking you could simply add a Graph::clear() function that calls std::vector::clear() for all of the vectors the graph contains. No additional refactoring would be needed.This is not a perfect solution for my issues but it would help, and it would certainly work because the clear() function would put the Graph back into a default-constructed state (except for the capacity of the vectors). It would also automatically destroy all of the Node objects via std::unique_ptr::~uniqueptr, which would be called automatically during nodes.clear().

This way a Graph could be reused for multiple computations, as long as the clear function is used in between. But since vector::clear() only changes the vector size and not its capacity, this would result in automatic reuse of the memory directly owned by Graph. The memory owned by Node's would be reallocated from scratch each time, but that consists of shorter chunks that contribute less to fragmentation, and this would certainly be a good step in the right direction. In any event, users would not mandated to use Graph::clear() and could instead continue to use a different object for each computation, if they so choose.

rvaser commented 5 years ago

I assumed that the fragmentation caused by nodes and edges will make the effect of the basic Graph::clear() function negligible. I'll add the function then.

paoloczi commented 5 years ago

You might well be right as I don't know for sure the contribution of nodes and edges to fragmentation, and it is possible that it is dominant. Although memory allocators tend to treat small chunks specially, with the result that they have less chance to create fragmentation.

How much this change will help will also depend on usage patterns (e. g. the lengths of alignments being computed). But eliminating the contribution of the Graph vectors will not hurt, and might help. Thank you for considering this change.

rvaser commented 5 years ago

Implemented in latest commit on master (v2.0.2).