Graph data augmentation `augmentation`/`aug` (sub)package

EdisonLeeeee commented 1 year ago

🚀 The feature, motivation and pitch

There is a surge in graph data augmentation research, and it is often important for self-supervised learning on graphs. I saw there is an implementation of DropEdge (dropout_adj) in the torch_geometric.utils module. Given the prevalence of graph augmentation methods, is there any plan to integrate them into PyG?

https://github.com/zhao-tong/graph-data-augmentation-papers

Alternatives

No response

Additional context

As a heavy PyG user, what I want is:

import torch_geometric.augmentation as A
aug1 = A.DropEdge()
aug2 = A.DropNode()
aug3 = ...

x, edge_indexx = aug1(x, edge_index)
# or x, edge_indexx = aug2(x, edge_index)

rusty1s commented 1 year ago

This sounds really cool. Currently, this would interfere a bit with our transforms package which is designed for augmentation. The transforms operates on data, while here it looks like we want to perform augmentations as part of the model. Do you have an idea about how to align these two?

EdisonLeeeee commented 1 year ago

Sorry for the late reply. I intended to implement these augmentation methods as part of transforms and accept data as input. However, the problem is, that methods in transforms are all inplaced ones. We shouldn't make the augmentations inplaced during training, right?

I assume that transforms package is used as data preprocessing before training, while augmentations (let's call it by that name for now) is to provide data augmentation during training. They are a bit different as I see. We can follow the implementation of dropout_adj to implement these methods, either as one function or class. WDYT?

EdisonLeeeee commented 1 year ago

Since augmentation is used as part of the model during training, how about implementing it as a subpackage of nn?

Padarn commented 1 year ago

I like this idea a lot.

I assume that transforms package is used as data preprocessing before training, while augmentations (let's call it by that name for now) is to provide data augmentation during training

Could we implement both using the same base functionality? One example is the AddSelfLoops transform, which applies the add_self_loops utility. We could migrate add_self_loops into a new augmentations package and then have Transforms just wrap them so you can chain them as preprocessing steps.

I can't think of any clean abstraction to combine these right now but I think moving some of these out of utils would make it easier to discover the functionality.

WDYT?

EdisonLeeeee commented 1 year ago

Sounds great! I personally like the change of migrating add_self_loops and some others into augmentations package. My major concern is, add_self_loops seems heavily used in utils and this would break the current contract. How to make sure backward compatibility for these functions?

To sum up, a simple roadmap is:

[ ] Add torch_geometric.augmentations package
[ ] Migrate some functions in utils into augmentations, e.g., add_self_loops, dropout_adj.
[ ] Support for graph augmentation methods (including examples)
[ ] Wrap functions in augmentations into transforms.

I am happy to take on these tasks and make contributions, but I think I need some help to start with. WDYT?

rusty1s commented 1 year ago

Sorry for the late reply. I intended to implement these augmentation methods as part of transforms and accept data as input. However, the problem is, that methods in transforms are all inplaced ones. We shouldn't make the augmentations inplaced during training, right?

This is not necessarily correct. The data object will be modified in-place, but underlying attributes are modified out-of-place.

Add torch_geometric.augmentations package

Rather than adding a new package, we could still group them within the transforms package, e.g., it would make sense for add_self_loops and AddSelfLoops to live in the same file IMO. Alternatively, we could add a transforms.functional package. WDYT?

Padarn commented 1 year ago

it would make sense for add_self_loops and AddSelfLoops to live in the same file IMO.

I agree with this.

Alternatively, we could add a transforms.functional package. WDYT?

Thats an interesting idea.

Maybe its premature to think about adding a new package without having a really clear scope. Maybe one way we could do it is to write some tutorials on graph augmentations, and if we see some clear common functionality/interface we could create a new package at that point?

EdisonLeeeee commented 1 year ago

Maybe its premature to think about adding a new package without having a really clear scope.

Apologies. I should have thought more carefully and thoroughly before that.

Alternatively, we could add a transforms.functional package. WDYT?

I totally agree with this idea. Would you mind if I make a PR working on this package? It might take a couple of weeks or longer.

rusty1s commented 1 year ago

No need for any apologies. Instead of one big PR, we could start by out-sourcing the functionality of individual transforms into single functions. Transform classes will then simply call these functions. Afterwards, we can quickly expose it to the public, either as part of transforms, transforms.functional or augmentations. WDYT?

Padarn commented 1 year ago

Oops sorry I didn't mean there was a problem @EdisonLeeeee.

Agree with @rusty1s's suggestion to start with a few small PRs.

EdisonLeeeee commented 1 year ago

Got it. Thanks @Padarn and @rusty1s for your suggestions. Really appreciate it!

rusty1s commented 1 year ago

Re-opening to track progress and keep the discussion going :)

downeykking commented 1 year ago

I am also doing research on graph augmentation. In my opinion, a more feasible way is to make the augmentation method exist as a separate function and encapsulate it into a class (in this way, the augmentation method can be packed into a package). For me personally, this 'functional call style' is very flexible. An example that can refer to is https://github.com/PyGCL/PyGCL/tree/main/GCL/augmentors. The calling method is

import GCL.augmentors as A
aug1 = A.Identity()
aug2 = A.RandomChoice([A.RWSampling(num_seeds=1000, walk_length=10),
                           A.NodeDropping(pn=0.1),
                           A.FeatureMasking(pf=0.1),
                           A.EdgeRemoving(pe=0.1)], 1)
...
# we init augmentor=(aug1, aug2) in the class
# Use the augmentations with the style of function.
class Encoder(nn.Module):
    def __init__(self, encoder, augmentor):
        super(Encoder, self).__init__()
        self.encoder = encoder
        self.augmentor = augmentor

    def forward(self, x, edge_index, batch):
        aug1, aug2 = self.augmentor
        x1, edge_index1, _ = aug1(x, edge_index)
        x2, edge_index2, _ = aug2(x, edge_index)
...

Hope this will be helpful :)

Padarn commented 1 year ago

Looking at @downeykking's example, it seems like maybe we could just do this all by providing an augmentor to a DataLoaderIterator? (already possible, its called transform_fn)

The rest of the interface described looks almost identical to the Transform already in pyg as @rusty1s mentioned.

@downeykking @EdisonLeeeee - is it ever necessary to augment the data within the model, or it is always done prior to model input.

EdisonLeeeee commented 1 year ago

@downeykking's example is what I wanted before.

is it ever necessary to augment the data within the model, or it is always done prior to model input.

I think graph augmentation should be used within the model to generate different augmentation views of graphs for contrastive learning. Like a simple dropout, the graph augmentation is also a regularization trick for each training round - that's why I thought Transform is not able to achieve this as it is done prior to the model inputs.

Padarn commented 1 year ago

Like a simple dropout, the graph augmentation is also a regularization trick for each training round - that's why I thought Transform is not able to achieve this as it is done prior to the model inputs.

Sorry I guess what I meant to ask was whether or not the augmentation needs to happen "prior to any model layers". Using the above example with a dataloader we could have (very roughly) something like:

import GCL.augmentors as A
aug = A.RandomChoice([A.RWSampling(num_seeds=1000, walk_length=10),
                           A.NodeDropping(pn=0.1),
                           A.FeatureMasking(pf=0.1),
                           A.EdgeRemoving(pe=0.1)], 1)

for data in DataLoaderIterator(..., transform_fn=aug)
          # data is transformed as it is iterated, so can apply random transforms
          out = model(data.x, data.edge_index)
          ....

EdisonLeeeee commented 1 year ago

Oh sorry I misunderstood it.

whether or not the augmentation needs to happen "prior to any model layers".

As far as I know, it is typically happened prior to any model layers. So it is feasible to be called outside the model. Your code example makes sense, but there might be a problem if someone wants to use multiple augmenters and return different graphs each time, as in the case provided by @downeykking. Besides, it seems that one has to use DataLoaderIterator even operate on the full graph without mini-batch training in your case. WDYT?

Padarn commented 1 year ago

but there might be a problem if someone wants to use multiple augmenters and return different graphs each time,

The transform is applied when the iterator returns an element, so if it has random behavior, it could produce different graphs each time.

Besides, it seems that one has to use DataLoaderIterator even to operate on the full graph without mini-batch training in your case. WDYT?

Sorry I didn't understand this point. Do you mind giving an example?

EdisonLeeeee commented 1 year ago

but there might be a problem if someone wants to use multiple augmenters and return different graphs each time

I meant we cannot use a DataLoaderIterator to produce multiple augmented graphs at one time. What I wanted is

for epoch in range(train_epochs):
    data1 = aug1(data)
    data2 = aug2(data)
    out1 = model(data1....)
    out2 = model(data2...)
    ...

it seems that one has to use DataLoaderIterator even to operate on the full graph without mini-batch training in your case.

Using DataLoaderIterator means we need to split the data into batches. So if one wants to just put the whole graph (full-batched) into the model, using DataLoaderIterator would somehow make it a bit complicated.

data = # load full graph as data
for epoch in range(train_epochs):
    data_aug = aug(data) # Using `DataLoaderIterator` would somehow make it a bit complicated
    out = model(data_aug.x, data_aug.edge_index)

But I think using DataLoaderIterator for augmentation would be still a practical solution in most cases.

Sorry for the confusion, please let me know if there is something unclear.

Padarn commented 1 year ago

In your final example, it could be

for epoch in range(train_epochs):
    data_aug = transform(data) 
    out = model(data_aug.x, data_aug.edge_index)

In your first example, the transform function could return a list of data objects?

for (a1, a2) in DataLoaderIterator(..., transform_fn=lambda x: (aug1(x), aug2(x)):
   ...

Not against the proposal, just trying to explore possible ideas.

EdisonLeeeee commented 1 year ago

Sounds interesting. You have addressed all my concerns now. Thank you @Padarn I think this roadmap has been made much clearer for me.

Padarn commented 1 year ago

Hey @EdisonLeeeee do you have a list somewhere of augmentations you'd like to add? Perhaps I could give you a hand :-)

EdisonLeeeee commented 1 year ago

That's awesome! Thank you @Padarn

Currently, I have the following lists to do, all of them are simple to use during training, but I'm not quite sure they are indeed necessary to be added to PyG

AddRandomWalkEdge from Pairwise Learning for Neural Link Prediction (GitHub: https://github.com/zhitao-wang/PLNLP)
FastGCN from FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling
LADIES from Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks
(TODO: Add FastGCNLoader or LADIESLoader?)
Some others used in PyGCL
NodeMixup or graphmixup --- Not sure which paper proposed it
...

To be added...

Padarn commented 1 year ago

Interesting! I need to read through some of your references

elilaird commented 2 months ago

Any update on this @EdisonLeeeee ? I'd love to help contribute!

EdisonLeeeee commented 2 months ago

Hi @elilaird

Here are some initial efforts:

Do you have any thoughts?

pyg-team / pytorch_geometric