victoresque / pytorch-template

PyTorch deep learning projects made easy.
MIT License
4.75k stars 1.09k forks source link

New handling of 'data_utils', adding: PyTorch data loader support, restructuring config file ++ #24

Closed borgesa closed 6 years ago

borgesa commented 6 years ago

Hi,

This pull request contains restructuring of the data handling. This includes splitting dataset and data loader handling into separate files. It also includes support for using the native data loader of PyTorch.

I look forward to your comments.

Changes:

Regarding BaseDataLoader

With reference to issue #6, 'BaseDataLoader' could be deprecated. This pull request brings us one step closer to achieving this.

The class can still be used after this pull requests (although I had to change references to the configuration file, due to new grouping structure of parameters)

SunQpark commented 6 years ago

Hi @borgesa, I appreciate your efforts on this huge change of the project structure. But I noticed that there are some disagreement between my thought and yours regarding the purpose of this project. So, let me first explain my thoughts and tell me what do you think about this.

This project is supposed to be a template, which exists to save time of users by recycling the repeated parts of project. Therefore, I believe it should be not only easy to use, but also be simple and minimal. It should be simple, so that people understand the overall structures of the things going on, and should be minimal, since standard deep learning projects built with pytorch does not have 'repeated parts' that much.

Your recent tendency to make everything done at the config file seems to be opposed with this direction. Making get_something to import module makes our project not minimal. People who opens data_loader.py may want to see how to load data, but they should first understand what this get_data_loaders function is, before they discover the small part of code that defines data_loader class.

For making the pytorch native datasets, transforms available at the config file, it does not even make things easier, since people have to learn how to setup config file. Isn't it feels natural that to define things related to the data at the files like data_loader.py and datasets.py?? What you are trying to do felt like creating a new user interface rather than improving the template for me.

Providing more features is fine, but I think it should always be easy to alter our methods by standard pytorch way of doing things. Since you referred the Issue #6 which deals with the deprecation of base_data_loader, I recommend you to check my PR #10 which closed the same issue. I think @victoresque decided to deprecate base_data_loader because it was heavy before that PR, and decided not to because it is no longer heavier than DataLoader of pytorch but provide better features(split validation, implementation of __len__, etc).

Maybe it would take us a lot of discussions and time to make some agreement, but I think we can.

SunQpark commented 6 years ago

On the other hand, I like the overall folder structure you've changed. How about moving logger and visualization into utils folder, and renaming datasets/ folder to inputs/ which is compatible with kaggle projects, or data/ as you mentioned in previous PR?

borgesa commented 6 years ago

Hi @SunQpark,

Thanks for the response. I welcome the discussion and agree that it is important that we align.

I have tried my best to answer your questions and address your concerns in a clear way below.

Apologies for the lengthy text.

Overall goal:

My goal is definitely the same as yours:

The second bullet above is why my proposal tends towards configuration in the 'config.json' file.

Config file:

My main arguments for the proposal is flexibility and automated saving of experiment parameters together with model checkpoints.

Once a user has created the necessary framework for an experiment*, then trying out different training parameters (optimizer, learning rate + scheduler, ...) only requires small changes in 'config.json'. If wanted, even model architecture can be tuned and compared using the config-file**.

The great thing about this is that since 'config.json' is saved inside the "save_dir", you automatically get an overview of the parameters you used in the different experiments***.

When this is combined with the TensorboardX-functionality that you implemented, the result is - the way I see it - pretty close to a complete experimentation setup for deep learning.

The training of different runs of the experiments can be compared in Tensorboard, and it is very easy to see which combinations of parameters resulted in better models.

Notes:

*: Mainly implementing a dataset class, and a model class, plus (ideally minor) tweaks to the training loop ++...

**: If the model class takes arguments handling this (e.g., the number of layers in a ResNet architecture)

***: This resembles the logic/interface used in this cs230 pytorch template, which I really like (although I have not studied the underlying code in detail)

'transforms' in config-file:

My argument is also here flexibility and logging of experiments. By storing all the main factors used in training a model in the same folder as the checkpoints, it is easier to reproduce results.

In the PyTorch Data loading and processing tutorial, 'transforms' are kept as an argument to the example dataset class initialization function. The proposed format aligns with this.

Additional benefit: hyper-parameter search:

An additional benefit of this configuration file convention is that we (later) can write functionality that enables hyper-parameter tuning with a very easy interface for the user.

This is one of the proposals I have been planning to pitch in for the "way ahead" of this repository.

Example workflow: 1) The user selects - let's say - two parameters and sets two values for each of them in a 'hyperparameter file' 2) 'train.py' is called with additional argument '--hp-search h-params.json' 3) 'train.py' sequentially executes the training with the resulting four combinations of hyperparameters 4) TensorboardX-logs are created in the same folder, so that the user easily can compare the results of the training

Dataloader:

Thanks for pointing me to PR #6, I have seen it previously, but did not look read all the code.

The main thing my proposal does is taking the train/validation split out of the abstract base class. I believe this makes the code more transparent (and closer to how other repositories handles data loaders).

The 'validation split' part of my proposal (inside fn 'get_data_loaders') is based on your code from 'BaseDataLoader._split_sampler').

I actually wrote the proposed handling of data loaders to make it closer to the standard PyTorch way:

If users are already used to PyTorch, then allowing them to directly use the standard data loader class is good.

Separation of dataset from data loader

I believe it is easier for users to implement their own dataset classes when the dataset class is handled separately from the data loader class.

Especially, since most (all?) users are likely to write their own dataset class, while many (most?) likely can do with the standard data loader.

'get-xyz'-functions:

I do understand your concern. There are two reasons why I propose the 'get_xyz'-functions:

1) It becomes very easy to read the logic structure of 'train.py' 2) It gives flexibility to users, for instance if they wish to test the same dataset on two different models, or even test one model on two different datasets

I think we can mitigate this by adding FAQs to the repository ("How do I implement my dataloader?", "How do I implement my network?") or a small tutorial, including references to PyTorch documentation.

Overall folder structure:

Yes, I agree with your proposal. If we can organize the files into fewer folder, yet keeping a logical and structure, it would be nice.

Summary

I hope my response gives you an understanding of my proposals (and the reason I have made them). My plan is to use this template myself in my own research projects going forward.

Look forward to your response :)

Communication proposal:

Apologies again for the long post. Quite a few things to reply to.

To ease discussion concerning topics such as this one, could it be an idea to create for instance a Slack channel associated with this repository?

SunQpark commented 6 years ago

config file

Ability to change something in config file does not always mean it's easier to configure. Sometimes it may result in just restricting the usage of functionality. I think one example is the way lr_scheduler is implemented in this repository. It is convenient while using simple scheduler instances like step or exponential, but it broke when I tried using ReduceLROnPlateau, since the logic it relies on is somewhat different from the simple ones. I think you will at least agree that we should be very careful doing things like this.

I concern that your implementation about controlling transforms(or other things) in the config file will result in similar thing. Moreover, configuring transforms in the json file also has no point in making the setup easier, since it would be not a python. Why should we bother initializing python objects like transforms in json file, abandoning the help from IDE??

I guess your main point is making configuration for experiment automatically saved, I suggest that we approach that problem by making information like transform instances saved as a separated log file with checkpoints. (in my past projects, I used to write it onto the training name)

data loader

I don't think I completely understood what is the standard pytorch way you mentioned, but encapsulation of tested code by class inheritance is at least standard way of OOP. If my data_loader breaks while training model, I would suspect the base_data_loader only after I examined everything in the 'data_loaderanddataset` classes. Suspecting the pytorch Data handling interface would be even after that. This hierarchy really helps. If so called 'standard pytorch way' denies it, I would just ignore it.

For those who wants to use pytorch native DataLoader instance, they can do that easily in current setup too, without any problem. Why do you think they would still want to use config-file for setting up and train-valid split you provide, while not wanting to use abstract base class? Because it is not transparent? Anyone can open and understand the base_data_loader.py. The only thing that we should do to make it clear, to write clear code in the base, as you suggested in the related issue. Moreover, the hierarchy will provide better interfaces to those who don't want to know the complex things behind the scene.

By the way, I have once felt I need template also for the dataset but I didn't made it for following reasons.

  1. Implementation of dataset changes a lot.
  2. 'pytorch official tutorial' already provides that.

Sometimes the dataset instance can be built with one line if it is supported by pytorch, but sometimes it can be hundreds of lines containing complex pre-processing. I would prefer building dataset inside of data_loader.py for the formal, but prefer the existence of separated dataset.py for latter. Even train - test does not applies to every projects. My personal thinking is that we do not need to make dataset as separated file. Even if we do so, it should be in a form that can be deleted anytime. The get_dataset function you proposed is preventing users from doing that, in this case.

Thank you for the suggestion of using Slack channel, but I prefer making conversation on this repo, since maybe someone interested in this project may feel curious about the reason why this template looks like. If you agree, I will put some commits on this PR. I feel sorry because the commits I want to add from now on would be mainly destructive ones.

borgesa commented 6 years ago

General

My proposal follows the configuration setup of @victoresque, ref. text from readme file:

- .json config file support for more convenient parameter tuning.

What I propose makes the parameter tuning even more flexible. The way I see it, it improves one of the main features of this repository.

As an example, my proposal also aligns quite well with how they handle configuration in pytorch-CycleGAN-and-pix2pix*. For instance, they do take 'lr_policy' (including ReduceLROnPlateau) as one of their arguments to training. Thereafter, they save the arguments that were used in 'opt.txt', together with the model. It is a tidy approach...

*: Note that I am not arguing that the rest of this repository should be modelled in the same way, since the one I mention is way to heavy for a template.

config file

I split my answers below:

I think one example is the way lr_scheduler is implemented in this repository. [...] it broke when I tried using ReduceLROnPlateau [...].

With the current implementation of 'lr_scheduler', it does break. However, if the configuration file takes kwargs to the 'lr_scheduler', then you remain fully flexible. Further, the applied scheduler will be saved with the model, and reproduction becomes easy.

I think you will at least agree that we should be very careful doing things like this.

[...] Moreover, configuring transforms in the json file also has no point in making the setup easier, since it would be not a python. Why should we bother initializing python objects like transforms in json file, abandoning the help from IDE??

I agree that we should be careful. In particular I also agree with regards to the transforms. Another flexible solution could/should be used.

[...] I suggest that we approach that problem by making information like transform instances saved as a separated log file with checkpoints. (in my past projects, I used to write it onto the training name)

Some parameters can off course be saved in the training name, but only a few (or the file name convention will quickly get out of hand).

However saving separate log files (and file names) with this information will require the user to write custom code to gather up all the parameters from different hard coded places in the code. Here is a scenario that hopefully illustrates why this is inconvenient:

- I train eight different parameter variations of a model training with parameters hard coded, as per the way I understand you want it
- Between starting each of these trainings, I have to make manual changes in several different files
- Upon reviewing results, I decide to continue training three of theese
- For each of these three, I need to:
    - Individually go into the checkpoint/save folder to get the custom parameters
    - Manually change the hardcoded parameters back to what they were
    - Resume the training
    - Repeat the manual task for the next one...

There is both added work and added risk putting in wrong parameters, doing this.

With my proposal, hyperparameter training can easily be automated and a simple "--resume /save/model_x" will be enough to continue the training of promising candidates...

data loader

I don't think I completely understood what is the standard pytorch way you mentioned [...].

I try to explain by commenting the BaseDataLoader below.

Currently, there are two encapsulations around 'torch.utils.data.DataLoader'. The added functionality of the first (doing the train/validation split) is normally done outside the data loader. The second only initializes the dataset, which is at least as tidy to do outside the data loader.

Comments to first BaseDataLoader:

class BaseDataLoader(DataLoader):
    def __init__(self, dataset, config, collate_fn=default_collate):
        self.dataset = dataset          # Comment: Already handled by DataLoader.__init__
        self.config = config            # Comment: The data loader does not need to save the entire config
        self.collate_fn = collate_fn    # Comment: Already handled by DataLoader.__init__

        self.batch_size = [...]         # Comment: Already handled by DataLoader.__init__
        self.validation_split = [...]   # Comment: Should in my opinion not be done inside the data loader
        self.shuffle = [...]            # Comment: Already handled by DataLoader.__init__

        self.batch_idx = 0              # Comment: This is not used elsewhere?
        self.n_samples = [...]          # Comment: 'len(self.sampler)' covers this in class externally, the
                                        #          property is accessible by 'len(data_loader.sampler)')

        self.sampler, self.valid_sampler = [...] # Comment: Splitting into training/validation inside the
                                                 #          dataloader is not standard PyTorch way

        self.init_kwargs = {                     # Comment: These comments relate to passing this dict to super().__init__()
            'dataset': self.dataset,             # Comment: Overrides 'self.dataset' from above
            'batch_size': self.batch_size,       # Comment: Overrides 'self.batch_size' from above
            'shuffle': self.shuffle,             # Comment: Handles shuffling, 'self.shuffle' is not needed
            'collate_fn': self.collate_fn        # Comment: Overrides 'self.collate_fn' from above
            }
        super(BaseDataLoader, self).__init__(sampler=self.sampler, **self.init_kwargs)

    def __len__(self):                  # Comment: This overrides the __len__ method of the standard PyTorch
        [...]                           #   data loader, which normally returns __len__ of the batch sampler. Currently
                                        #   they do the same, so I do not think it is needed to include this.

    def _split_sampler(self, split):    # Comment: As I wrote above, I think that separation of the data into
        [...]                           #   training and validation is not the data loaders job. It should be
                                        #   done outside of this class.
    def split_validation(self):         # Comment: Same as above
        [...]

To summarize: the BaseDataLoader to some degree reinvents the wheel.

My preference is currently to remove dependency on it and use the default PyTorch data loader as standard (delete lines 45-58 in 'data_loader.py' of my pull request and maybe even delete 'BaseDataLoader.py').

Comments to example "MnistDataLoader":

Composing transforms and loading the dataset can in the Mnist case easily be done outside the data loader. In my view it is more tidy to do so.

Other

For those who wants to use pytorch native DataLoader instance, they can do that easily in current setup too, without any problem.

No, they cannot easily do that. The template does not currently support using the native 'DataLoader', without users manually implementing a training/validation split (since Trainer assumes there is a data loader for validation, which there really also should be).

I am not saying that it is hard to do this, but I most users would expect general training/validation splitting to be covered by the template.

The splitting would be equivalent to what is currently done inside 'BaseDataLoader'. Pulling it out (as I have done in my pull request) enables it to be used for both 'BaseDataLoader' and the native PyTorch 'DataLoader'. This is another argument to pull this functionality out from the 'BaseDataLoader'.

Why do you think they would still want to use config-file for setting up and train-valid split you provide, while not wanting to use abstract base class? Because it is not transparent?

As I have mentioned above: I do not think that the added functionality belongs inside a data loader, and without this functionality it has the same function as the native DataLoader (quite literally).

By the way, I have once felt I need template also for the dataset but I didn't made it for following reasons.

  1. Implementation of dataset changes a lot.
  2. 'pytorch official tutorial' already provides that.

Yes, and the dataset class skeleton I added should off course refer to the official tutorial. The location of the "CustomDataset" illustrates to template users where to place their custom dataset classes, and the reference to the tutorial shows them what they need to do.

[...] My personal thinking is that we do not need to make dataset as separated file. Even if we do so, it should be in a form that can be deleted anytime. The get_dataset function you proposed is preventing users from doing that, in this case.

Less heavy users that are getting to know PyTorch (or deep learning) will typically only need to implement a couple things: the dataset class and the model class. This can enable us to write easy tutorials with several examples, using only these two classes (the template handles the rest).

Having a designated place where they implement the dataset class and model class is thus natural.

If users want to change the template and delete 'dataset.py, they can just as easily delete or modify 'get_dataset'...

Slack

Thank you for the suggestion of using Slack channel, but I prefer making conversation on this repo, since maybe someone interested in this project may feel curious about the reason why this template looks like.

My intention was off course not making discussions unavailable, but a Slack channel connected to this repository. Integrations exists to connect the two.

Regarding commits

If you agree, I will put some commits on this PR. I feel sorry because the commits I want to add from now on would be mainly destructive ones.

I prefer that you do not do this yet. I am at least interested in the opinion of @victoresque first.

Besides, if it is decided to not go this direction, then it will be more tidy to close this pull request.

SunQpark commented 6 years ago

Thank you for pointing out some points I missed with detailed response. It seems that the main conflicts between us lie in following two points.

  1. configuration of data_utils in the config.json
  2. where the validation split of data_loader should be

I will keep this comment short concentrating on these two most important points. Tell me if you think there are other points as important as those. Let's talk about those points that differs most, and fine-tune the other details later.

config file

As I understand, what you propose for the first is to adding them for the flexibility of hyper parameter search. My point is not on whether to do this or not, but on how much they should be configurable. In my opinion, selecting which dataset or data_loader is clearly not included in the hyper-parameter tuning, but in the basic setup for projects, since changing the data will completely reset the process of experiments. I think you will agree on this.

About the transforms it is less obvious. I agree that adding them to config file will give us more flexibility, but my opinion for that is still, disadvantages are larger. The disadvantages lies in

  1. our template being heavy for that functionality
  2. inconvenience of setting up python module in the json file
  3. making the config file looks complex

Moreover, I think the setup for data augmentation should be done independently with the other tuning processes. That's because the purpose of other processes lies in optimization of the validation metric, while that of augmentation lies in finding appropriate levels of modification on the input data, not making it have different distributions of test-set.

data_loader

As I understood, what you want is to do split_validation outside of the data_loader instance since it is standard pytorch way. But I still don't get the reason we should follow that standard, if it hurts the readability of our code. Previous way of handling that may don't follow the standard, but it clearly provided better interface in the data_loader.py class. Thank you for pointing out some points I've missed in the base_data_loader but I still don't see why it should be deprecated. Why should reinvention of wheel prohibited if the new wheel can do their job better than old one?

others

I perfectly agree with you that we need some other's opinion at this point. I feel sorry for bothering you with lengthy conversation but how do you think about this, @victoresque?

I didn't knew slack could do that. Would you please tell me how to get the slack channel @borgesa ?

borgesa commented 6 years ago

Github + Slack

Never never set up a "github + slack" pair mysel, just seen that some projects use them. Not sure how it works, but I believe that for instance the developers of Atom IDE uses it (not been on it myself, so I am not sure how it works). Think it's a combination of an open slack channel + the github integration.

It might be too much setup for this project and issue/pull request comments is likely sufficient for most discussions.

Links: https://slack.github.com/ https://get.slack.help/hc/en-us/articles/232289568-GitHub-for-Slack

Further development

For info: I am likely to continue moving a copy of this repository towards the direction I explain above, since I need it for my work. For now, I am doing it on a branch of my fork (current branch).

borgesa commented 6 years ago

Since @victoresque has not responded and @SunQpark has updated most of the repository, disregarding the ideas I brought up, I close this pull request.

yjlolo commented 4 years ago

My comment and some small modifications that follow enables making a list of transform in config.json in order to configure the dataset wrapped in data_loader:

In config.json, add transform under data_loader:

I can input a list of transform (and the corresponding arguments) in the args of data_loader. image

A helper function in util.py, to be used for the customized data_loader:

image

In data_loaders.py, construct the customized dataset with the parsed transform:

image

Why we need this

First, we might want to include some transformers for our loaded data. Second, more importantly, we might want to have different set of parameters for those transformers. E.g., sampling rate for audio samples, number of frequency bins for audio spectrograms, value range for input normalization, etc.. Therefore, it is more convenient if all the specifications can be done in config.js, which also enhances reproducibility in a sense that all parameters (now including those for input representations in addition to models and trainers) are in one place.