[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends

Padarn commented 2 years ago

🚀 The feature, motivation and pitch

The goal of this roadmap is to understand how we might support sampling across multiple node or edge types in NeighborLoader and LinkNeighborLoader. As an example, currently, the constructor of NeighborLoader takes a single NodeType with an optional index. For example

NeighborLoader(
    data=data, 
    input_nodes=('paper', index)
)

However, a user (for example https://github.com/pyg-team/pytorch_geometric/discussions/4707) may reasonably want to instead sample from multiple edge types, so we would like to propose a way to support this directly. Thus we would like to support this for both NeighborLoader and LinkNeighborLoader.

Interface

A proposed interface for this could be to extend the supported input types to include a tuple of tuples for the sampling index. This would probably mean also supporting different negative sampling ratios. An example of the new interface would be:

NeighborLoader(
    data=data, 
    input_nodes=(('paper', index), ('author', index))
)

The output of this for NeighborLoader could remain the same, but the sampling would be extended to cover the different input types.

For the case of LinkNeighborLoader, the interface change could be similar

LinkNeighborLoader(
    data=data, 
    edge_label_index =(
        (('paper', 'author'), index), 
        (('author', 'paper'), index))
)

For the LinkNeighborLoader we store dictionaries data[type].edge_label and data[type].edge_label_index to allow users to keep track of which edges the sampled edges correspond to

data['edge_type'].edge_label
data['edge_type'].edge_label_index

This is easy to extend to multiple types. But we may want to implement a simple way to find out which edge types were sampled in a given minibatch.

data.get_sampled_edge_types

Implementation

To implement this we need to decide whether sampling happens randomly across all types in a batch, or whether the sampling is done for each type independently. To keep it simple for now, we would keep a single batch_size and assume the user wants to sample randomly from all of the types/indexes provided, so for example

NeighborLoader(
    data=data, 
    batch_size=N
    input_nodes=(('paper', index), ('author', index2))
)

would return a random sample of N nodes from union([index, index2]).

Risks

This adds complication in usage, as to effectively predict on a mini-batch you would need to compute the heads for each node/edge type. This needs to be explored through some examples and help from the community.

Roadmap

[ ] NeighborLoader.
- [ ] Support multiple node types in NeighborLoader. #5013
- [ ] Examples of multiple node types in NeighborLoader.
[ ] LinkNeighborLoader.
- [ ] Support multiple edge types in LinkNeighborLoader.
- [ ] Examples of multiple edges types in LinkNeighborLoader.
- [ ] Support negative_sampling_ratio in LinkNeighborLoader for different edge types.
[x] Benchmark converting list of tuples to dictionary as suggested in comment: https://github.com/pyg-team/pytorch_geometric/issues/4765#issuecomment-1147105584

Any feedback and help from the community is highly appreciated!

cc: @rusty1s

rusty1s commented 2 years ago

Thanks @Padarn.

To avoid confusion with the existing implementation, when using the new functionality the properties data.edge_label and data.edge_label_index would instead be stored on the edge type themselves:

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

One challenge (which may reduce runtime) is to define what we pass to the parent constructor in super().__init__(). I think we would need to pass in a list of tuples now, e.g.,

super().__init__([("paper, 0), ("paper, 1), ...], ...)

and convert this back to dictionaries of node indices. This may be expensive to compute since we are operating on Python primitives here, so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

Padarn commented 2 years ago

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

Thanks, fixed.

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

Sorry I mistakenly remembered that we stored the index and labels without edge type. I've updated this to make it clear.

Padarn commented 2 years ago

so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

Very good point. Will add an item for this.

Padarn commented 2 years ago

I ran some benchmarks using the new benchmarks/loader/neighbor_loader.py and the code changes in https://github.com/Padarn/pytorch_geometric/tree/padarn/neigbour-benchmark - I didn't try to optimize at all yet, just implement exactly what was suggested: Converting to and from dict representation

super().__init__([("paper, 0), ("paper, 1), ...], ...

Here are some results.

Old Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.165, 3.033, 3.074], average runtime=3.091
batch size=8192, iterations=270, runtimes=[3.783, 3.555, 3.67], average runtime=3.669
batch size=4096, iterations=540, runtimes=[4.04, 3.893, 3.879], average runtime=3.937

New Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.126, 3.094, 3.067], average runtime=3.096
batch size=4096, iterations=540, runtimes=[4.18, 4.061, 4.003], average runtime=4.081
batch size=2048, iterations=1080, runtimes=[4.275, 4.391, 4.329], average runtime=4.332

So its definitely not free.

Padarn commented 2 years ago

I guess the obvious way to do this would be to create a flat graph, like

input_nodes = {'paper':[0,1,2], 'author':[0,1,2]}

becomes

input_nodes = [0, 1, 2, 0+num_paper_nodes, 1+num_paper_nodes,2+num_paper_nodes]

The offsets could be added as a property to the NeighborSampler.

rusty1s commented 2 years ago

Thanks for the benchmarks. They don't look that worrying :) Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

Padarn commented 2 years ago

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

True. I was thinking it might be quicker as it could probably be done in tensor operations, but I see that might not actually be any better.

Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

Agreed!

pyg-team / pytorch_geometric