[API Proposal] Suggestions for a sensible Dataset API

PavlosMelissinos commented 7 years ago

Hey there, so I've been silently working on a few ideas I've had about a sensible dataset format and I'm currently looking for some feedback before I open a PR here. Could someone check out my version of the MS-COCO dataset and tell me what I would need to add/change in order to make it worthy of a merge?

For instance:

It would make more sense if there were separate files for each dataset, like keras-contrib already is.
I'm in favor of a base, abstract dataset class with methods, like flow, and attributes like IDS/CLASS_NAMES that should be overriden. I could use some help with that as well though, I'm not sure how that should work.
I'd like to remove MSCOCOReduced as a separate class and introduce the possibility of dataset subsets instead, as a feature of the class itself.
Transformations of the data should be separate from the definition of the dataset itself. As soon as a common format is finalized, applying transformations to diverse datasets should be a piece of cake.
Maybe turn the Dataset class (or the output of "flow") into a Sequence? That way it could be combined with keras-transform

I'm especially interested in your opinions regarding my use of a configuration file instead of feeding raw parameter values through code. I prefer my way visually and I think it's more readable but on the other hand it might be harder to debug.

Thanks a lot, I'm looking forward to your feedback!

EDIT: My code is for semantic segmentation but I've intentionally kept the title vague, this discussion can accommodate any kind of task.

ahundt commented 7 years ago

I think there are some valuable discussion points and pain points discussed in the tensorflow datasets api https://github.com/tensorflow/tensorflow/issues/7951, and interesting details on the kinds of transformations people have needed.

ahundt commented 7 years ago

the _init_( config) parameter would probably make the most sense simply being kwargs, which is how python handles all parameters to any function as a dictionary anyway.

PavlosMelissinos commented 7 years ago

Hey there, glad to see you so active! 👍

That tf thread seems handy, I could use the perspective.

the init( config) parameter would probably make the most sense simply being kwargs, which is how python handles all parameters to any function as a dictionary anyway.

Well duh, indeed it makes perfect sense. I don't know why that didn't even cross my mind...

Thanks for the suggestions!

ahundt commented 7 years ago

@PavlosMelissinos It might be best to email the keras mailing list or post this in keras itself for more feedback

ahundt commented 6 years ago

@PavlosMelissinos is your version of the COCO dataset in a condition that can be merged here? My version isn't that great so it would be nice to pull it in!

PavlosMelissinos commented 6 years ago

@ahundt I've improved it a bit but it still needs a lot of cleaning up and I'm not actively working on it anymore. I want to make it more modular and there are a few things that I don't like. I've also pledged to contribute to a keras dataset project with @DEKHTIARJonathan which I've neglected for months (sorry Jonathan!). You can follow that repo if you want and even offer your own implementation if you have one.

The actual implementation for the time being is still on my enet repo, along with some alternative network architectures but they all have to go.

ahundt commented 6 years ago

Ok I actually think I will need to use coco for some stuff and I already use keras-contrib for that so I'd like to just replace the current coco loading script with yours, rather than any of the models or anything. Plus, the coco data format supports other datasets too so it will be super convenient!

PavlosMelissinos commented 6 years ago

In that case feel free to take anything you want from my repo and modify it to your needs, I don't think I'll be doing any work on it for the next few days/weeks.

PavlosMelissinos commented 6 years ago

Oh, on a side note, I originally preferred using json files to define dataset parameters but not anymore. One of the first things I'd change now is turn it into code ("hard-coded" object constants) to remove a layer of (unnecessary) abstraction & complexity.

ahundt commented 6 years ago

Yeah I've found that "hard coding" is actually the way to go in python, as long as you separate the APIs so that you can change the settings programmatically too. Essentially python becomes your programming language and your config file language, then your commits naturally keep track of exactly what you ran. :-)

keras-team / keras-contrib

[API Proposal] Suggestions for a sensible Dataset API #153