mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

Definition of feature caching for node classification #548

Open mfbalin opened 5 days ago

mfbalin commented 5 days ago

https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#14-appendix-benchmark-specific-rules

Here, it is stated that feature caching is not allowed. What is the definition of feature caching?

We are preparing to make a submission using the GraphBolt GNN dataloader and our framework has support for feature and graph caching on GPUs with no redundancy across GPUs. We also support caching in the system memory so I am wondering whether I can utilize any of these components for a valid closed MLPerf submission for gnn node classification.

GraphBolt's caching facilities: https://www.dgl.ai/dgl_docs/generated/dgl.graphbolt.CPUCachedFeature.html https://www.dgl.ai/dgl_docs/generated/dgl.graphbolt.GPUCachedFeature.html

ShriyaPalsamudram commented 4 days ago

Based on the rules, feature caching of any form is not allowed.

@drcanchi can you please review GraphBolt's caching and comment on whether this is any different and whether it can be used?

mfbalin commented 4 days ago

@ShriyaPalsamudram why is such caching not allowed? Both CPU and GPU memory hierarchies are made of multiple levels and caching is prevalently used to make anything run fast in hardware.

In our case, we treat GPU memory as a cache for the CPU memory which is a cache for the SSD storage.

ShriyaPalsamudram commented 4 days ago

The reason to disallow faature caching is to make the benchmark representative of real-world GNN workloads which typically work on much much larger datasets (and features). Because we couldn't access an open-sourced dataset that matches in size, we had to settle for a smaller one but make the benchmark be as representative as possible.

mfbalin commented 4 days ago

The reason to disallow faature caching is to make the benchmark representative of real-world GNN workloads which typically work on much much larger datasets (and features). Because we couldn't access an open-sourced dataset that matches in size, we had to settle for a smaller one but make the benchmark be as representative as possible.

Even when the dataset is large, caching will be employed to extract maximum performance from the underlying hardware. I guess we will have to make a submission in the open category to showcase what our software is capable of. Will any future submission utilizing caching qualify for the open division?