Open Padarn opened 2 years ago
Thanks @Padarn.
To avoid confusion with the existing implementation, when using the new functionality the properties data.edge_label and data.edge_label_index would instead be stored on the edge type themselves:
Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.
In addition, I am not sure what "Support multiple node types in LinkNeighborLoader
" means. Do you mean edge types instead?
One challenge (which may reduce runtime) is to define what we pass to the parent constructor in super().__init__()
. I think we would need to pass in a list of tuples now, e.g.,
super().__init__([("paper, 0), ("paper, 1), ...], ...)
and convert this back to dictionaries of node indices. This may be expensive to compute since we are operating on Python primitives here, so I would appreciate if we could do some benchmarking first. Let me know your thoughts!
In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?
Thanks, fixed.
Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.
Sorry I mistakenly remembered that we stored the index and labels without edge type. I've updated this to make it clear.
so I would appreciate if we could do some benchmarking first. Let me know your thoughts!
Very good point. Will add an item for this.
I ran some benchmarks using the new benchmarks/loader/neighbor_loader.py
and the code changes in https://github.com/Padarn/pytorch_geometric/tree/padarn/neigbour-benchmark - I didn't try to optimize at all yet, just implement exactly what was suggested: Converting to and from dict representation
super().__init__([("paper, 0), ("paper, 1), ...], ...
Here are some results.
Old Version
Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.165, 3.033, 3.074], average runtime=3.091
batch size=8192, iterations=270, runtimes=[3.783, 3.555, 3.67], average runtime=3.669
batch size=4096, iterations=540, runtimes=[4.04, 3.893, 3.879], average runtime=3.937
New Version
Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.126, 3.094, 3.067], average runtime=3.096
batch size=4096, iterations=540, runtimes=[4.18, 4.061, 4.003], average runtime=4.081
batch size=2048, iterations=1080, runtimes=[4.275, 4.391, 4.329], average runtime=4.332
So its definitely not free.
I guess the obvious way to do this would be to create a flat graph, like
input_nodes = {'paper':[0,1,2], 'author':[0,1,2]}
becomes
input_nodes = [0, 1, 2, 0+num_paper_nodes, 1+num_paper_nodes,2+num_paper_nodes]
The offsets could be added as a property to the NeighborSampler
.
Thanks for the benchmarks. They don't look that worrying :) Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.
I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.
I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.
True. I was thinking it might be quicker as it could probably be done in tensor operations, but I see that might not actually be any better.
Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.
Agreed!
🚀 The feature, motivation and pitch
The goal of this roadmap is to understand how we might support sampling across multiple node or edge types in
NeighborLoader
andLinkNeighborLoader
. As an example, currently, the constructor ofNeighborLoader
takes a singleNodeType
with an optional index. For exampleHowever, a user (for example https://github.com/pyg-team/pytorch_geometric/discussions/4707) may reasonably want to instead sample from multiple edge types, so we would like to propose a way to support this directly. Thus we would like to support this for both
NeighborLoader
andLinkNeighborLoader
.Interface
A proposed interface for this could be to extend the supported input types to include a tuple of tuples for the sampling index. This would probably mean also supporting different negative sampling ratios. An example of the new interface would be:
The output of this for
NeighborLoader
could remain the same, but the sampling would be extended to cover the different input types.For the case of
LinkNeighborLoader
, the interface change could be similarFor the
LinkNeighborLoader
we store dictionariesdata[type].edge_label
anddata[type].edge_label_index
to allow users to keep track of which edges the sampled edges correspond toThis is easy to extend to multiple types. But we may want to implement a simple way to find out which edge types were sampled in a given minibatch.
Implementation
To implement this we need to decide whether sampling happens randomly across all types in a batch, or whether the sampling is done for each type independently. To keep it simple for now, we would keep a single
batch_size
and assume the user wants to sample randomly from all of the types/indexes provided, so for examplewould return a random sample of
N
nodes fromunion([index, index2])
.Risks
This adds complication in usage, as to effectively predict on a mini-batch you would need to compute the heads for each node/edge type. This needs to be explored through some examples and help from the community.
Roadmap
NeighborLoader
.NeighborLoader
. #5013NeighborLoader
.LinkNeighborLoader
.LinkNeighborLoader
.LinkNeighborLoader
.negative_sampling_ratio
inLinkNeighborLoader
for different edge types.Any feedback and help from the community is highly appreciated!
cc: @rusty1s