pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.28k stars 3.65k forks source link

Add properties to Dataset object #1742

Open Hafplo opened 4 years ago

Hafplo commented 4 years ago

šŸš€ Feature

I would like to have to ability to add custom attributes to custom dataset. My use case is: paths and tables with the original files and GT. Other use cases: flags, metadata, sources

Motivation

After processing the Dataset we are left with Data / Batch object with minimal context (numeric tensors) without knowledge about how they were created, when, by which process or from what source. It would be beneficial to add this info to the Dataset object in order to 'save' all parameters and inputs used in the "process" method (this is similar to saving hyper-parameters for each model training experiment). Since downloading and processing the Dataset takes time, and since the methods and sources may change (especially when working on non-benchmark datasets OR experimenting with new types of graphs), it is essential to 'keep track' on our input dataset during research and experimentation

Additional context

My current workaround:

class MyDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None, debug_flag=False):
        self.gt_csv_path = GT_PATH
        self.gt_df = pd.read_csv(self.gt_csv_path)
        self.project = None
        self.debug = debug_flag
        super().__init__(root, transform, pre_transform)

        self.data, self.slices = torch.load(self.processed_paths[0])

Note: I want to use those properties while calling the "process" method, so they are defined before super init

rusty1s commented 4 years ago

Hi and thanks for this issue. However, I'm not sure if I understand the problem exactly. What's wrong with your current "workaround"? Note that you are free to save anything you like inside your custom dataset. For example, we already do this for the GEDDataset, see here.

Hafplo commented 4 years ago

It works fine. But I believe calling "super().init()" after defining some properties is considered bad practice.

It also goes down to having the "process" method called implicitly by "init". So the 'init' method is the same for all custom datasets, but the 'process' method is different for each one (and can be very complicated and intensive). Maybe separating them will make it more clear?

rusty1s commented 4 years ago

Mh, I personally think that the implicit downloading and processing is very convenient. If you want to avoid "bad practice", you can also set your attributes in the process method and save them to disk.