flomlo commented 3 years ago

Description

Whilst analysing the (admittedly, huge) connectome from this statistical reconstruction of a mouse v1 (https://www.dropbox.com/sh/w5u31m3hq6u2x5m/AAAGAOP5nQlTuyA8Uhg3hMTZa/GLIF_network/network/v1_v1_edges.h5?dl=0),

pyflagser.flagser_count_unweighted throws RuntimeError around 20s after starting: RuntimeError: Out of bounds, tried to add an edge involving vertex 34902, but there are only 34316 vertices.

Admittedly, this .h5 file contains a massive graph (250k vertices, 70mio edges).

Steps/Code to Reproduce

load big graph into python
run flagser_count_unweighted(graph)

Expected Results

pyflagser performs it duties and does not crash

Actual Results

----> 1 flagser_count_unweighted(x)

~/.local/lib/python3.9/site-packages/pyflagser/flagser_count.py in flagser_count_unweighted(adjacency_matrix, directed)
     51 
     52     # Call flagser_count binding
---> 53     cell_count = compute_cell_count(vertices, edges, directed)
     54 
     55     return cell_count

RuntimeError: Out of bounds, tried to add an edge involving vertex 34902, but there are only 34316 vertices.

Versions

Python 3.9.2 (default, Feb 20 2021, 18:40:11) 
[GCC 10.2.0]
pyflagser 0.4.4

Further notes

I'll look into it soon. I've got a few suspects (e.g. compiling flagser with 32-bit uints for vertices, which is a compile flag), I would also like to see what happens with the flagio functions.

I'll also have a look at the minimum graph size required to trigger the bug.

If you have other hot leads, let me know.

flomlo commented 3 years ago

ok, on a 50% reconstruction of the huge connectome (it's still a 700mb .h5 file, so I'll refrain from posting it here) the error is as follows: Out of bounds, tried to add an edge involving vertex 50038, but there are only 49926 vertices.

I'm posting that additional bit of information because before I was kinda suspicious about the 34902/34316 numbers, as they are pretty close to 2^15 (resulting in .flag files of 2^16 line length cause of whitespaces). But this does not seem to be the problem.

ulupo commented 3 years ago

Hey @flomlo, thanks for the report!

Unless size really is a problem, I suspect foul play from https://github.com/giotto-ai/pyflagser/blob/a4f933a91613b4892afaae5e63343ec531027e13/pyflagser/_utils.py#L8 somehow. Perhaps we could see what vertices and edges are and whether there are some inconsistencies with them?

Is this a directed graph? And is it in sparse format?

flomlo commented 3 years ago

Hi,

I've checked it all and now I'm absolutely certain it is due to vertex_index_t being set to a unsigned short in https://github.com/luetge/flagser/blob/292bcf15202eaa27a81612acfb746235bed56480/include/definitions.h#L12.

I've tried another Graph with 115462 vertices (the 50% reconstruction mentioned above), which complains that it only has 49926 vertices. But 115462 mod 2^16 is exactly 49926. So this makes me absolutely sure that this bug is due to 2^16 being too small :D

I'm currently compiled flagser for myself with MANY_VERTICES enabled, but it seems to be a problem if HDF5 is enabled at the same time. So I'm currently digging through this.

ulupo commented 3 years ago

Excellent debugging! I'm glad this is clear. I'm also summoning @MonkeyBreaker so he is aware of this conversation as it just turned low-level.

flomlo commented 3 years ago

Assuming that I manage to compile all that properly, do you have a testsuite to test for performance regressions?

I suspect just adding the MANY_VERTICES flag to https://github.com/giotto-ai/pyflagser/blob/master/CMakeLists.txt (and maybe preventing HDF5 support to be compiled alongside, because it is not used, anyway(?)) will remove my bug here.

But it will result in performance regressions for smaller graphs. Maybe not in computation time (this is hard to predict), but certainly in memory usage (probably twice as much? But hard to tell, as well).

I will leave the decision if the regressions are acceptable to you, maybe I can just write a small extension describing what to do in case of HUGE graphs somewhere in the README. I think one can expect a user to compile their own version of pyflagser if they encounter this kind of problems.

MonkeyBreaker commented 3 years ago

HI !

Nice digging, about the issue, we should in Python at least check for the number of vertices, and at least print a warning about the current limitations from the C++ backend.

@flomlo Looking about MANY_VERTICES, it only defines the size of vertex_index_t and index_t. In my opinion and from personal experience, I wouldn't expect a big drop in performances, I would more expect a gain in performances, due to the CPU architectures. But, it will consume more memory, that's for sure.

In my opinion, we could enable by default MANY_VERTICES, of course check if it doesn't break anything thank to our test pipeline :sunglasses:.

@flomlo if you are ready to create a PR, I'll go in this direction (@ulupo what do you think ?):

Write a regression test with a big enough dataset (should be generated on the fly)
Fix the issue by enabling MANY_VERTICES
Document current limitation
Enjoy new Pyflagser

Of course, I'm more than glad to review any work. I'm just currently too busy with other tasks to create myself the PR.

Best, Julián

flomlo commented 3 years ago

Hi,

yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.

I've confirmed in the meantime that flagser-count with MANY_VERTICES (and disabled HDF5 support) performs correctly.

About the speed regression test: I'ld use pyflagser on the BBP connectome (32k vertices, 8m edges, topologically complex) cause I have it lying around anyways. This needs about 1h of computation time and iirc ~8GB or RAM (without MANY_VERTICES), which is a proper test, I think. Then I'll test it again with MANY_VERTICES enabled and inform you about the results.

Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.

It will be done sometime afternoon or over the weekend.

Best,

Florian

ulupo commented 3 years ago

Thanks both! Just curious: if the memory increase is significant, couldn't we have both versions available and pick the appropriate one at runtime?

About

Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.

Since we would just be checking for index failures, why not just create a sparse graph which consists just of disconnected points save for a single edge between two vertices with very large index?

flomlo commented 3 years ago

@ compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.

@ testing: I though the testing was to check for memory/time regression? If it is just for checking it >65k-vertices-graphs are working, then yes, that is a trivial test.

ulupo commented 3 years ago

At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !

Note that if we compile both versions there is no need for such tests because we would have no choice as to which version to use beyond a certain size.

MonkeyBreaker commented 3 years ago

yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.

I'll would add something like target_compile_definitions(flagser_count_pybind PRIVATE MANY_VERTICES=1) after line 42 in CMakeLists.txt this should do the work.

compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.

I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS, for MANY_VERTICES I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t and index_t at runtime instead of generating multiples binaries.

MonkeyBreaker commented 3 years ago

At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !

@ulupo I think that for large datasets, benchmark should be done locally and not on the CI. Maybe we should add to the documentation how users that would like to contribute to the performances side of the library, to measure on their machine the performances with public datasets in order to benchmark their version with the current main (main is now master branch name on github) of pyflagser.

flomlo commented 3 years ago

I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS, for MANY_VERTICES I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t and index_t at runtime instead of generating multiples binaries.

I'm also not a fan for reasons of elegance, but if I understand it correctly, MANY_VERTICES is a flag used at compiletime in order to build more efficient code?

Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.

MonkeyBreaker commented 3 years ago

Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.

Right now, as you say, it's a compile time definition, but with meta-programming, we could easily make it a runtime condition. If you look at my PR in flagser https://github.com/luetge/flagser/pull/36, I make KEEP_FLAG_IN_MEMORY a runtime condition instead of a compile time definition.

flomlo commented 3 years ago

Oh, then this definitely is out of my programming skills scope :D

For now at least ...

flomlo commented 3 years ago

Ok, here are the test results of flagser-memory --min-dim 4 with and w/o MANY_VERTICES on the BBP connectomic data, computed on my personal Ryzen 3600X with 64gb of RAM. To make the test results diagnostically less conclusive, there were a few other tasks (firefox, thunderbird ...) running in the background of the machine as well.

Without MANY_VERTICES: 4m03s, maximum RAM: 17.49GB With MANY_VERTICES: 3m55s, maximum RAM: 18.72GB

See for full output: time_on_flagser_with_MANY_VERTICES.txt time_on_flagser_without_MANY_VERTICES.txt

Result: I would ignore the additional 7% of memory usage (as well as the 3% faster execution, which is probably just measuring error) and just add MANY_VERTICES to the compile flags of pyflagsers CMakeList.txt. I do not see the need for version without MANY_VERTICES enabled. If someone is this low on RAM, he should not use python anyways :D

Do you agree?

MonkeyBreaker commented 3 years ago

For me, yes we could enable by default MANY_VERTICES, I should be able to come for a templated solution in flagser in the following days (but on the meantime, we could integrate this quick fix). Waht do you think @ulupo ?

Thank you for the measurements :) !

flomlo commented 3 years ago

If you think that the templated solution is that quick to implement, then I'm happy just to use the MANY_VERTICES-version locally for the next few weeks. Works for me™

[EDIT: Use it locally]

MonkeyBreaker commented 3 years ago

I think it's doable to do it, but right now I cannot spend the time require, and also, the merge on flagser could also take some time. I think that merging a temporary fix is a good approach if we include a quick test and update the documentation.

flomlo commented 3 years ago

I've patched flagser (see https://github.com/luetge/flagser/pull/40), now there is an error in case of too many vertices, requesting the user to compile with MANY_VERYTICES enabled.

I think this error should occur from pyflagser as well, so if there is an accidental regression again, the user will get a meaningful warning instead of the hard-to-interpret error I've gotten.

danielaegassan commented 2 years ago

It seems like this issues is closed, but I actually still having the same issue. That is, I do not get a meaningful warning about the graph being too large. What am I missing? Here an example on a simple but large graph

Steps/code to reproduce

import scipy.sparse import numpy as np import pyflagser n=100000 data=np.random.random(100) row=np.random.choice(n,100,replace=False) col=np.random.choice(n,100,replace=False) adj=scipy.sparse.coo_matrix((data,(row,col)),[n,n])

Checking for self loops

print(np.count_nonzero(adj.diagonal(0)!=0))

Checking matrix shape

print(adj.get_shape()) pyflagser.flagser_weighted(adj)

Expected results

pyflagser runs without errors or gives a meaningful message that the graph is too large

Actual results

RuntimeError Traceback (most recent call last)

in 13 #Checking matrix shape 14 print(adj.get_shape()) ---> 15 pyflagser.flagser_weighted(adj) /gpfs/bbp.cscs.ch/project/proj102/venv/p102/lib/python3.8/site-packages/pyflagser/flagser.py in flagser_weighted(adjacency_matrix, max_edge_weight, min_dimension, max_dimension, directed, filtration, coeff, approximation) 246 247 # Call flagser binding --> 248 homology = _compute_homology(vertices, edges, min_dimension, 249 _max_dimension, directed, coeff, 250 _approximation, filtration)[0] RuntimeError: Out of bounds, tried to add an edge involving vertex 39854, but there are only 34464 vertices.

ulupo commented 2 years ago

Hi @danielaegassan! Looking at this whole thread and the related PR #68, I think this issue was just temporarily abandoned due to me being distracted by other projects at the time, but if @flomlo and @MonkeyBreaker are happy with this I can just merge #68 and fix the issue.

MonkeyBreaker commented 2 years ago

Hi @danielaegassan,

As @ulupo said, the issue was temporarily abandoned, thank you for taking the time into looking why we were encountering the issue. We will merge #68 and I don't know the timeline to publish a release of pyflagser with the fix.

Will keep you updated when the new version will be published.

ulupo commented 2 years ago

Just merged #68, which closes this issue. @danielaegassan indeed you can wait for the next pyflagser release, but I suspect you'd rather not have to. Perhaps you would be happy to pull the merged changes in a clone of the repo and compile pyflagser from sources?

giotto-ai / pyflagser

`pyflagser.flagser_count_unweighted` crashes on really huge graphs #65

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Further notes

Checking for self loops

Checking matrix shape