tensorflow / neural-structured-learning

Training neural models with structured signals.
https://www.tensorflow.org/neural_structured_learning
Apache License 2.0
980 stars 189 forks source link

Out of Memory issue when building large graph #72

Closed victorconan closed 3 years ago

victorconan commented 3 years ago

I have 800k instances with 200 dimensional embeddings. I am trying to build the graph using nsl.tools.build_graph with similarity threshold of 0.95. My driver type is r4.4xlarge. I keep having OOM error. Anyone knows how to estimate how much memory I need?

aheydon-google commented 3 years ago

Hi, @victorconan, thanks for your interest and bug report!

The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:

Note that build_graph exists for backward compatibility and has been deprecated. Please switch to using build_graph_from_config in the same package instead.

Thanks!

victorconan commented 3 years ago

Hi, @victorconan, thanks for your interest and bug report!

The memory required by the graph builder is a function not only of the input data, but also the resulting graph. A couple of questions for you:

  • How big is the TFRecord file from which you're reading your examples? The TFRecord files are only 714MB
  • When you call build_graph, what values (if any) are you supplying for the lsh_splits and lsh_rounds flags? I used lsh_splits = 32 and lsh_rounds = 20. I am a little confused about the statement in the documentation that We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.. That seems to suggest the max lsh_splits should be 10?
  • Do you see any output written to your terminal? The program writes an INFO line every 1 million edges it creates. I am using Databricks, so only saw this on the Log4j output:
    
    Uptime(secs): 31200.0 total, 600.0 interval
    Cumulative writes: 75K writes, 75K keys, 75K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s
    Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
    Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
    Interval writes: 291 writes, 291 keys, 291 commit groups, 1.0 writes per commit group, ingest: 0.01 MB, 0.00 MB/s
    Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
    Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop

Sum 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0

> * Are you able to run the `top` unix program in another shell window while running the graph builder to determine the program's virtual and real memory usage?
**I can see from the Databricks Ganglia Plot that:**
min Avg max
Use 36.4G 87.3G 140.5G
Total 341G 805.8G 1.2T

> Note that `build_graph` exists for backward compatibility and has been deprecated. Please switch to using `build_graph_from_config` in the same package instead.
**Thanks, will switch to `build_graph_from_config`**
> Thanks!

Thanks!
aheydon-google commented 3 years ago

Hi, @victorconan.

Unfortunately, I'm unfamiliar with the runtime environment you're using, so I can't really offer much help. Our graph builder is currently limited to running on a single machine and must store all node features and the resulting graph edges in memory (at least when using the lsh_splits and lsh_rounds configuration parameters). We are considering providing a more scalable graph builder in the future, but we have not yet undertaken that effort. On my workstation at work, I've successfully run it on a set of 50K nodes, as described in the build_graph_from_config API docs. 800K is quite a bit larger than that.

I am a little confused about the statement in the documentation that We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000.. That seems to suggest the max lsh_splits should be 10?

That formula places a lower bound on lsh_splits, not an upper-bound. If your nodes tend to be grouped in clusters in the embedding space, you may need a much larger value than that lower bound (as you're currently doing).

There are a couple of things I can think of that you might experiment with:

Please reply back on this bug, and I'll try to help further if I can. Thank you.

victorconan commented 3 years ago

Hi @aheydon-google ,

Thanks for the reply!

Thanks!

aheydon-google commented 3 years ago

Thanks for the update! If you're using a threshold of 0.99 and the graph builder is running for 3 days, that's a problem. What that tells me is that at least 1 of your LSH buckets is quite large. That needs to be better understood.

One thing that might help is getting access to the log messages that the graph builder writes. I'm not sure why those aren't currently being written for you. Are you invoking build_graph as a program as described in the nsl.tools Overview? If not, I think it would be good if you could do that, since I believe it should enable INFO-level logging.

Please let us know how it goes. Thanks!

aheydon-google commented 3 years ago

Closing this issue for now. Please feel free to re-open if you have further questions.