Closed flomlo closed 2 years ago
ok, on a 50% reconstruction of the huge connectome (it's still a 700mb .h5 file, so I'll refrain from posting it here) the error is as follows:
Out of bounds, tried to add an edge involving vertex 50038, but there are only 49926 vertices.
I'm posting that additional bit of information because before I was kinda suspicious about the 34902/34316 numbers, as they are pretty close to 2^15 (resulting in .flag files of 2^16 line length cause of whitespaces). But this does not seem to be the problem.
Hey @flomlo, thanks for the report!
Unless size really is a problem, I suspect foul play from
https://github.com/giotto-ai/pyflagser/blob/a4f933a91613b4892afaae5e63343ec531027e13/pyflagser/_utils.py#L8
somehow. Perhaps we could see what vertices
and edges
are and whether there are some inconsistencies with them?
Is this a directed graph? And is it in sparse format?
Hi,
I've checked it all and now I'm absolutely certain it is due to vertex_index_t
being set to a unsigned short
in https://github.com/luetge/flagser/blob/292bcf15202eaa27a81612acfb746235bed56480/include/definitions.h#L12.
I've tried another Graph with 115462 vertices (the 50% reconstruction mentioned above), which complains that it only has 49926 vertices. But 115462 mod 2^16 is exactly 49926. So this makes me absolutely sure that this bug is due to 2^16 being too small :D
I'm currently compiled flagser
for myself with MANY_VERTICES
enabled, but it seems to be a problem if HDF5 is enabled at the same time. So I'm currently digging through this.
Excellent debugging! I'm glad this is clear. I'm also summoning @MonkeyBreaker so he is aware of this conversation as it just turned low-level.
Assuming that I manage to compile all that properly, do you have a testsuite to test for performance regressions?
I suspect just adding the MANY_VERTICES flag to https://github.com/giotto-ai/pyflagser/blob/master/CMakeLists.txt (and maybe preventing HDF5 support to be compiled alongside, because it is not used, anyway(?)) will remove my bug here.
But it will result in performance regressions for smaller graphs. Maybe not in computation time (this is hard to predict), but certainly in memory usage (probably twice as much? But hard to tell, as well).
I will leave the decision if the regressions are acceptable to you, maybe I can just write a small extension describing what to do in case of HUGE graphs somewhere in the README. I think one can expect a user to compile their own version of pyflagser
if they encounter this kind of problems.
HI !
Nice digging, about the issue, we should in Python at least check for the number of vertices, and at least print a warning about the current limitations from the C++ backend.
@flomlo Looking about MANY_VERTICES
, it only defines the size of vertex_index_t
and index_t
. In my opinion and from personal experience, I wouldn't expect a big drop in performances, I would more expect a gain in performances, due to the CPU architectures. But, it will consume more memory, that's for sure.
In my opinion, we could enable by default MANY_VERTICES
, of course check if it doesn't break anything thank to our test pipeline :sunglasses:.
@flomlo if you are ready to create a PR, I'll go in this direction (@ulupo what do you think ?):
MANY_VERTICES
Of course, I'm more than glad to review any work. I'm just currently too busy with other tasks to create myself the PR.
Best, Julián
Hi,
yeah, I'll try modifying the CMakeLists.txt of pyflagser
locally and see how it goes. This will take some time, I'm a total noob with CMake.
I've confirmed in the meantime that flagser-count
with MANY_VERTICES (and disabled HDF5 support) performs correctly.
About the speed regression test: I'ld use pyflagser
on the BBP connectome (32k vertices, 8m edges, topologically complex) cause I have it lying around anyways. This needs about 1h of computation time and iirc ~8GB or RAM (without MANY_VERTICES), which is a proper test, I think.
Then I'll test it again with MANY_VERTICES enabled and inform you about the results.
Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.
It will be done sometime afternoon or over the weekend.
Best,
Florian
Thanks both! Just curious: if the memory increase is significant, couldn't we have both versions available and pick the appropriate one at runtime?
About
Yes, that is not an autogenerated dataset, but getting a huge random network to have a computation time of 1m < t < 1h is surprisingly nasty.
Since we would just be checking for index failures, why not just create a sparse graph which consists just of disconnected points save for a single edge between two vertices with very large index?
@ compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.
@ testing: I though the testing was to check for memory/time regression? If it is just for checking it >65k-vertices-graphs are working, then yes, that is a trivial test.
At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !
Note that if we compile both versions there is no need for such tests because we would have no choice as to which version to use beyond a certain size.
yeah, I'll try modifying the CMakeLists.txt of pyflagser locally and see how it goes. This will take some time, I'm a total noob with CMake.
I'll would add something like target_compile_definitions(flagser_count_pybind PRIVATE MANY_VERTICES=1)
after line 42 in CMakeLists.txt
this should do the work.
compiling both and choosing at runtime: Yes, that's actually quite feasible. Not beautiful, but should combine the best of both worlds.
I'm not a big fan of compiling both, we already do this with USE_COEFFICIENTS
, for MANY_VERTICES
I think we should be able to add to the C++ backend a template argument to choose the size of vertex_index_t
and index_t
at runtime instead of generating multiples binaries.
At the moment, my prejudice is that true memory/performance tests for very large datasets cannot realistically become part of the CI test suite because we are limited in both resources and total time (60 min total including installing environments, compiling, installing python dependencies and running all tests). But I'll wait to see what comes of your local tests, @flomlo !
@ulupo I think that for large datasets, benchmark should be done locally and not on the CI. Maybe we should add to the documentation how users that would like to contribute to the performances side of the library, to measure on their machine the performances with public datasets in order to benchmark their version with the current main
(main is now master branch name on github) of pyflagser
.
I'm not a big fan of compiling both, we already do this with
USE_COEFFICIENTS
, forMANY_VERTICES
I think we should be able to add to the C++ backend a template argument to choose the size ofvertex_index_t
andindex_t
at runtime instead of generating multiples binaries.
I'm also not a fan for reasons of elegance, but if I understand it correctly, MANY_VERTICES
is a flag used at compiletime in order to build more efficient code?
Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES
-version absolutely requires that both of them have been compiled separately.
Either I'm not aware of some advanced programming tricks, or having the performance-benefits of a slim version and a MANY_VERTICES-version absolutely requires that both of them have been compiled separately.
Right now, as you say, it's a compile time definition, but with meta-programming, we could easily make it a runtime condition. If you look at my PR in flagser
https://github.com/luetge/flagser/pull/36, I make KEEP_FLAG_IN_MEMORY
a runtime condition instead of a compile time definition.
Oh, then this definitely is out of my programming skills scope :D
For now at least ...
Ok, here are the test results of flagser-memory --min-dim 4
with and w/o MANY_VERTICES
on the BBP connectomic data, computed on my personal Ryzen 3600X with 64gb of RAM. To make the test results diagnostically less conclusive, there were a few other tasks (firefox, thunderbird ...) running in the background of the machine as well.
Without MANY_VERTICES
: 4m03s, maximum RAM: 17.49GB
With MANY_VERTICES
: 3m55s, maximum RAM: 18.72GB
See for full output: time_on_flagser_with_MANY_VERTICES.txt time_on_flagser_without_MANY_VERTICES.txt
Result:
I would ignore the additional 7% of memory usage (as well as the 3% faster execution, which is probably just measuring error) and just add MANY_VERTICES to the compile flags of pyflagser
s CMakeList.txt.
I do not see the need for version without MANY_VERTICES enabled. If someone is this low on RAM, he should not use python anyways :D
Do you agree?
For me, yes we could enable by default MANY_VERTICES
, I should be able to come for a templated solution in flagser
in the following days (but on the meantime, we could integrate this quick fix). Waht do you think @ulupo ?
Thank you for the measurements :) !
If you think that the templated solution is that quick to implement, then I'm happy just to use the MANY_VERTICES
-version locally for the next few weeks. Works for me™
[EDIT: Use it locally]
I think it's doable to do it, but right now I cannot spend the time require, and also, the merge on flagser
could also take some time.
I think that merging a temporary fix is a good approach if we include a quick test and update the documentation.
I've patched flagser
(see https://github.com/luetge/flagser/pull/40), now there is an error in case of too many vertices, requesting the user to compile with MANY_VERYTICES enabled.
I think this error should occur from pyflagser
as well, so if there is an accidental regression again, the user will get a meaningful warning instead of the hard-to-interpret error I've gotten.
It seems like this issues is closed, but I actually still having the same issue. That is, I do not get a meaningful warning about the graph being too large. What am I missing? Here an example on a simple but large graph
Steps/code to reproduce
import scipy.sparse import numpy as np import pyflagser n=100000 data=np.random.random(100) row=np.random.choice(n,100,replace=False) col=np.random.choice(n,100,replace=False) adj=scipy.sparse.coo_matrix((data,(row,col)),[n,n])
print(np.count_nonzero(adj.diagonal(0)!=0))
print(adj.get_shape()) pyflagser.flagser_weighted(adj)
Expected results
pyflagser runs without errors or gives a meaningful message that the graph is too large
Actual results
RuntimeError Traceback (most recent call last)
Hi @danielaegassan! Looking at this whole thread and the related PR #68, I think this issue was just temporarily abandoned due to me being distracted by other projects at the time, but if @flomlo and @MonkeyBreaker are happy with this I can just merge #68 and fix the issue.
Hi @danielaegassan,
As @ulupo said, the issue was temporarily abandoned, thank you for taking the time into looking why we were encountering the issue.
We will merge #68 and I don't know the timeline to publish a release of pyflagser
with the fix.
Will keep you updated when the new version will be published.
Just merged #68, which closes this issue. @danielaegassan indeed you can wait for the next pyflagser
release, but I suspect you'd rather not have to. Perhaps you would be happy to pull the merged changes in a clone of the repo and compile pyflagser
from sources?
Description
Whilst analysing the (admittedly, huge) connectome from this statistical reconstruction of a mouse v1 (https://www.dropbox.com/sh/w5u31m3hq6u2x5m/AAAGAOP5nQlTuyA8Uhg3hMTZa/GLIF_network/network/v1_v1_edges.h5?dl=0),
pyflagser.flagser_count_unweighted
throws RuntimeError around 20s after starting:RuntimeError: Out of bounds, tried to add an edge involving vertex 34902, but there are only 34316 vertices.
Admittedly, this .h5 file contains a massive graph (250k vertices, 70mio edges).
Steps/Code to Reproduce
flagser_count_unweighted(graph)
Expected Results
pyflagser
performs it duties and does not crashActual Results
Versions
Further notes
I'll look into it soon. I've got a few suspects (e.g. compiling
flagser
with 32-bit uints for vertices, which is a compile flag), I would also like to see what happens with the flagio functions.I'll also have a look at the minimum graph size required to trigger the bug.
If you have other hot leads, let me know.