Closed victorconan closed 3 years ago
Hi, @victorconan, thanks for your interest and bug report!
The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:
build_graph
, what values (if any) are you supplying for the lsh_splits
and lsh_rounds
flags?top
unix program in another shell window while running the graph builder to determine the program's virtual and real memory usage?Note that build_graph
exists for backward compatibility and has been deprecated. Please switch to using build_graph_from_config
in the same package instead.
Thanks!
Hi, @victorconan, thanks for your interest and bug report!
The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:
- How big is the TFRecord file from which you're reading your examples? The TFRecord files are only 714MB
- When you call
build_graph
, what values (if any) are you supplying for thelsh_splits
andlsh_rounds
flags? I usedlsh_splits = 32
andlsh_rounds = 20
. I am a little confused about the statement in the documentation thatWe have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.
. That seems to suggest the maxlsh_splits
should be 10?- Do you see any output written to your terminal? The program writes an INFO line every 1 million edges it creates. I am using Databricks, so only saw this on the Log4j output:
Uptime(secs): 31200.0 total, 600.0 interval Cumulative writes: 75K writes, 75K keys, 75K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 291 writes, 291 keys, 291 commit groups, 1.0 writes per commit group, ingest: 0.01 MB, 0.00 MB/s Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Sum 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
> * Are you able to run the `top` unix program in another shell window while running the graph builder to determine the program's virtual and real memory usage?
**I can see from the Databricks Ganglia Plot that:**
min | Avg | max | |
---|---|---|---|
Use | 36.4G | 87.3G | 140.5G |
Total | 341G | 805.8G | 1.2T |
> Note that `build_graph` exists for backward compatibility and has been deprecated. Please switch to using `build_graph_from_config` in the same package instead.
**Thanks, will switch to `build_graph_from_config`**
> Thanks!
Thanks!
Hi, @victorconan.
Unfortunately, I'm unfamiliar with the runtime environment you're using, so I can't really offer much help. Our graph builder is currently limited to running on a single machine and must store all node features and the resulting graph edges in memory (at least when using the lsh_splits
and lsh_rounds
configuration parameters). We are considering providing a more scalable graph builder in the future, but we have not yet undertaken that effort. On my workstation at work, I've successfully run it on a set of 50K nodes, as described in the build_graph_from_config
API docs. 800K is quite a bit larger than that.
I am a little confused about the statement in the documentation that We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.. That seems to suggest the max lsh_splits should be 10?
That formula places a lower bound on lsh_splits
, not an upper-bound. If your nodes tend to be grouped in clusters in the embedding space, you may need a much larger value than that lower bound (as you're currently doing).
There are a couple of things I can think of that you might experiment with:
set()
of 2-tuples, where each 2-tuple contains the source ID and target ID of an edge. So if you're using really long strings for the node IDs, that could consume a lot of memory, I suppose. embedding_files
argument is a list of files.)Please reply back on this bug, and I'll try to help further if I can. Thank you.
Hi @aheydon-google ,
Thanks for the reply!
lsh_splits
to 128 and reduce lsh_rounds
to 1 and see how long it takes. I do notice that so far the tsv file is around 11G (previously with threshold 0.95, it is about 157G)Thanks!
Thanks for the update! If you're using a threshold of 0.99 and the graph builder is running for 3 days, that's a problem. What that tells me is that at least 1 of your LSH buckets is quite large. That needs to be better understood.
One thing that might help is getting access to the log messages that the graph builder writes. I'm not sure why those aren't currently being written for you. Are you invoking build_graph
as a program as described in the nsl.tools
Overview? If not, I think it would be good if you could do that, since I believe it should enable INFO-level logging.
Please let us know how it goes. Thanks!
Closing this issue for now. Please feel free to re-open if you have further questions.
I have 800k instances with 200 dimensional embeddings. I am trying to build the graph using
nsl.tools.build_graph
with similarity threshold of 0.95. My driver type is r4.4xlarge. I keep having OOM error. Anyone knows how to estimate how much memory I need?