Proposal: Unbundling Keras modules into separate repositories

fchollet commented 6 years ago

We should consider moving the modules preprocessing and applications to separate repositories under keras-team: keras-team/preprocessing and keras-team/applications.

They would be listed as a dependency of keras, and would still be importable from e.g. keras.preprocessing.

Why?

They're not part of "the core Keras API" (which is about model building and training)
They're not really coupled to the rest of the codebase, so unbundling is safe
Faster CI runs, better test coverage
Unbundling makes it easier to develop in an open-source setting
Easier to sync across keras-team/keras and tf.keras (shared dependencies instead of code duplication)
Other projects might have need of Numpy preprocessing utilities

Any comments or concerns? Who would be interested in taking greater ownership of these repos, should they materialize?

fchollet commented 6 years ago

We could also do the same with dataset. Less important though, since dataset is for debugging/demo purposes, and not for doing any real work.

Dref360 commented 6 years ago

Minor concern : Since preprocessing is where we push all of your data augmentation API, should Data augmentation Layers (proposed https://github.com/keras-team/keras/pull/9056/files) be in the keras/preprocessing or since it's a layer, we should push it on Keras?

We should define clear boundaries for each repos. Also, will others repos be allowed to add dependencies?

And most examples in keras/examples will rely on all 3 repos. Should we add a separate setup.py for this folder to avoid circular dependencies?

fchollet commented 6 years ago

And most examples in keras/examples will rely on all 3 repos

They would just depend on Keras. Preprocessing/etc should still be importable from Keras. This change would be a pure refactoring, where the code is moved to a different place but the APIs don't change.

should Data augmentation Layers (proposed https://github.com/keras-team/keras/pull/9056/files) be in the keras/preprocessing or since it's a layer, we should push it on Keras?

For now I would suggest Keras contrib. But conceptually a layer would definitely be in Keras, not keras-processing, regardless of what it does. The question is thus whether we want data augmentation layers as part of the core Keras API.

Dref360 commented 6 years ago

Okay, all fine then!

srjoglekar246 commented 6 years ago

If anybody hasn't begun working on this, I would like to give it a go!

fchollet commented 6 years ago

The two new repositories will need "owners". An owner is someone who has write rights and bears responsibility for reviewing PRs (and merging if they think it should be merged), and to a large extent, for answering issues. I will still give guidance on API choices, but that will be limited to user-facing APIs decisions.

@srjoglekar246 which repo are you interested in working on, keras-processing, or keras-applications?

@taehoonlee would you be interested in becoming owner of the keras-applications repo, since you have done a lot of work on applications so far?

fchollet commented 6 years ago

@farizrahman4u @ozabluda @ahundt do you have any interest in getting involved too (see comment above)?

srjoglekar246 commented 6 years ago

keras-processing sounds good then!

taehoonlee commented 6 years ago

@fchollet, I'd love to! I wonder if you intend to separate tests for the cores and the applications (as described in "Faster CI runs, better test coverage"). My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.

fchollet commented 6 years ago

My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.

Testing applications when changing layers would fall into "integration tests", not unit tests. Issues with the layers should be caught by the layers unit tests. We could add one application-related integration test to fix this.

taehoonlee commented 6 years ago

OK, I understand.

srjoglekar246 commented 6 years ago

@fchollet Does it make sense to start work on this with a new personal repo and then move it all to one owned by keras-team?

farizrahman4u commented 6 years ago

@fchollet

They would be listed as a dependency of keras, and would still be importable from e.g. keras.preprocessing.

For applications, this would be the other way around right? keras would be a dependency of applications.

There is some "preprocessing" code in applications as well, (imagenet_utils). Would that be moved to applications repo or preprocessing repo?

What about the examples? Separate repo?

Overall I think this is great. I use the numpy-only stuff in projects that do not actually use keras, refactoring it into a seperate repo would be really useful.

srjoglekar246 commented 6 years ago

@farizrahman4u I see that you maintain one of Keras' spin-offs. I will probably look over the commits in your repo to get an idea of TODOs :-P

fchollet commented 6 years ago

For applications, this would be the other way around right? keras would be a dependency of applications.

The split modules need to be importable from Keras, so they would be a dependency of Keras. We're not changing any API.

That means that keras-applications will need to dynamically load Keras at runtime. This is required anyway since we'll want the same applications module to work for tf.keras too!

There is some "preprocessing" code in applications as well, (imagenet_utils). Would that be moved to applications repo or preprocessing repo?

That's application-specific, so it would stay in the keras-applications repo.

What about the examples? Separate repo?

I think the examples are fine as they are, we don't have the same constraints there. They're already well-separated from the codebase itself (and the API), and they're not tested so they don't inflict additional load on CI. They're also not replicated in tf.keras so we don't need a new "single source of truth" mechanism for them -- they're already in one place.

farizrahman4u commented 6 years ago

It's time for...

Dref360 commented 6 years ago

That's application-specific, so it would stay in the keras-applications repo.

A lot of people still use this function for their preprocessing.

fchollet commented 6 years ago

In terms of timeline, I'll do the split, and it will take place in late March or early April.

ahundt commented 6 years ago

This process can be managed fairly smoothly if you'd like to do it, but I'd like to provide some food for thought based on lessons I've taken from seeing this process unfold in several other widely used projects (boost, homebrew, and ROS are examples). Here are some questions I'd ask and answer:

What concrete benefits are there from a user focused perspective?
Are these benefits a significant proportion of users actually care about?
Do these benefits to users offset the cost and complexity of added install and versioning steps?
Is there is a risk of some unforeseen problems + extra user issues?
How upset will users be about having to "fix" stuff when they update?

At this time, the effects of items like "easier to sync with tf.keras" seem murky with respect to users and might benefit from better definition and/or alternative solutions. What if CI problems were solved with backend infrastructure changes that would be invisible to users? Could "owners" simply be assigned keras subdirectories? Everything might be totally worthwhile in the end, I'm just hoping to give perspective on alternative options.

The project might also consider taking the following steps, if appropriate. I believe they turned out to be enormously helpful on other projects:

Establish a clear versioning + dependency updating process in advance
Responsibilities for between repository breakage should be clearly defined
- Updates to one repository will very frequently break others in tricky ways.
Avoid assuming "you should always track either master / the latest release" will be feasible for everyone.
- The homebrew project split into a bunch of "modular" repositories and two years later merged back down to two repositories (the "brew" app and "recipe" installer scripts) to better manage the manifold dependency problems that emerged.
Make a clear policy on how breakage between versions will be handled (or not).
Think about how additional policy changes will be made if the process doesn't turn out as expected.
Provide clear messages and automatic detection / acquisition of the correct dependencies with a way to override it.

Some of the above can probably come from pip so perhaps it isn't as big deal in this case when compared to others I'm familiar with.

Example issues that might come up include:

Import errors mysterious to users who aren't very experienced with python.
Someone will inevitably want to use their preprocessing workflow on a past version with the current Keras version and vice versa, and then post about the API conflicts they encounter.

Hopefully this information is helpful with navigating the process for any approach you decide to take!

fchollet commented 6 years ago

@ahundt thanks for the detailed thoughts. These are all very good points.

the effects of items like "easier to sync with tf.keras" seem murky with respect to users

The idea is that this is a refactoring change (refactoring to get less code duplication): it makes life easier for devs (which ultimately had indirect user benefits) while not affecting the experience at all for end users.

Specifically, we're trying to:

avoid code duplication across tf.keras and keras-team/keras (duplication ultimately could create issues for end users of either tf.keras or keras-team/keras, since the same versioning problems you mention still exist in this setup and are arguably worse)
make CI faster, which helps contributors

I agree that we need a clear plan to make sure that the user experience is not affected. I think automatic dependency management (pip) solves most issues.

Establish a clear versioning + dependency updating process in advance

We should push out new releases of preprocessing and applications every time we do a Keras release. Inversely, having the new modules do independent releases doesn't seem to be an issue, assuming that the new releases keep backwards compatibility (which they should).

Let's say current Keras is 2.1.8 and current preprocessing is 1.0.1. We want to update Keras. Then we:

1) release preprocessing 1.0.2 which lists keras>=2.1.8 as a dependency 2) (same for applications) 3) release keras 2.1.9 which lists preprocessing 1.0.2 as a dependency

This implies that the bleeding edge version of preprocessing and applications is always compatible both with the latest Keras release and with the bleeding edge of Keras.

Responsibilities for between repository breakage should be clearly defined

Thankfully we're in a situation where this will not happen. If this was a risk we would not do it. In our case preprocessing and applications are very well scoped and separate from core Keras. Keras doesn't depend on them at all, and they're just consumers of a stable subset of the Keras API (Sequence and layers/models respectively). Any change in Keras that would break the new modules would be in itself a bug, since Keras isn't supposed to make breaking changes to these APIs. In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).

What we should do is CI-test the new modules both against the last release of Keras, and against the bleeding edge version of Keras. That way we ensure that changes in bleeding-edge modules don't break any existing users, and that they are ready to be used with bleeding-edge Keras. This guarantees an additional layer of stability.

(the reverse is not a problem since the Keras logic does not depend on the new modules, the pip dependency will only be to keep the existing namespace keras.preprocessing/etc).

Avoid assuming "you should always track either master / the latest release" will be feasible for everyone.

I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.

Think about how additional policy changes will be made if the process doesn't turn out as expected.

If we have intractable issues, we can always revert to the current state of affairs at any time, since:

What we're doing, and the reverse op, would be 100% API-compatible with the past
Doing so is just a matter of copy/pasting code, so can be done quickly and without issues

Provide clear messages and automatic detection / acquisition of the correct dependencies with a way to override it.

pip + clear, actionable error messages in case of ImportError should suffice here.

ahundt commented 6 years ago

This sounds like a pretty reasonable plan to me. I'll just mention a couple additional potential pain points to consider based on your comments:

In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).

This would definitely mitigate many potential issues. I think to make this happen the following steps might be worthwhile:

Perhaps some currently private APIs might be worth making public in this case to avoid code duplication
Have travis check for and prevent external calls with leading underscores.
Have something in the readme on how to request an internal API be made external, this may simply be one sentence that goes in with the existing API change process.

Responsibilities for between repository breakage should be clearly defined

Thankfully we're in a situation where this will not happen. If this was a risk we would not do it.

I'll mention a couple specific examples of conflicts that showed up later in travis. keras-contrib actually broke quite a few times over the first few months due to internal keras changes for which internal calls were not easily avoidable without duplication. Examples include:

imagenet_utils was one major source when the _obtain_input_shape() API changed
_preprocess_conv2d_input() was removed
travis changes to CI itself
keras changes to .travis.yml
dylib breakage during travis setup

Everything can be resolved if there are cycles available to fix it, I just wanted to give a couple of specific examples of pain point categories that will multiply as the number of repositories increases.

I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.

Someone, or many individuals might deploy code in production and not be able to update for one valid reason or another. One option is to have periodic LTS releases are the typical way to meet the needs of groups like that, but it isn't the only way to go.

I hope that helps, overall your proposed plan sounds reasonable! Thanks for taking my ideas into consideration.

fchollet commented 6 years ago

I've been delaying doing this because I wanted to think about it for a while and make sure this was the right thing to do. Thanks everyone for the various comments and feedback! Now, let's make a decision.

@Dref360 @taehoonlee: should we do this split of preprocessing and applications into separate repositories, and would you take ownership of the split repos? (@Dref360 for preprocessing and @taehoonlee for applications?

We'll make a decision asap based on your answers, and do the split next week if the decision is to do it.

Dref360 commented 6 years ago

We should definitely do the split for keras.applications, which would be the "official" keras-zoo.

For keras.preprocessing, I think your motivations are valid. And the community will be much more easier to manage with it. I'm kinda worried about the

That means that keras-applications will need to dynamically load Keras at runtime.

We should be very cautious with that. A user that tries to do import keras.preprocessing and gets Cannot find keras with a traceback that goes through keras-preprocessing repositery will be confused.

Also, we would need CI for multiple keras versions and deprecate the versions overtime.

So overall I would like the repo to move foward with the split and we should follow every @ahundt advices. I think all of them are valid concerns.

On the ownership, I can take care of keras.preprocessing, I would need some help with keras.preprocessing.sequence since it's not my main area of research. I am open to get some help from other contributors as needed. :)

fchollet commented 6 years ago

Cool, thanks @Dref360!

@taehoonlee, any thoughts?

Dref360 commented 6 years ago

Also, we have tons of PR waiting for those modules. When the split is done, we would have to go through it.

taehoonlee commented 6 years ago

@fchollet, It would be nice if we could do the split. Especially, I strongly agree with:

the justification, "They're not part of the core Keras API" [1],
the advantage for contributors, "Faster CI runs" [1],
the advantage for @fchollet, "Easier to sync across keras-team/keras and tf.keras" [1].

There are no benefits as @ahundt pointed out. However, in my opinion, this is not a problem because the intention of the split is to facilitate management without harming the user experience.

Of course, the potential issues may actually occur. I think that the issues are not the reasons why we should not split, but the matters we need to care about in the future.

taehoonlee commented 6 years ago

And, I'd like to take ownership of the keras-team/keras-applications.

ahundt commented 6 years ago

keras-contrib currently has a specific, clear example (travis link) of the sort of breakage I mentioned might be a concern in unbundled modules. I believe master moved some files around and removed some legacy code, and now the keras-contrib build doesn't work.

https://github.com/keras-team/keras-contrib/pull/263

I haven't figured out a fix yet, but haven't spent much time on it either. This is just for awareness that this sort of thing would become much more common, and the person who makes the change many not realize or have a chance to fix the issue downstream. This sort of thing also doesn't show up until travis automatically re-runs the dependency, which in the case of keras-contrib is up to a 24 hour delay.

taehoonlee commented 6 years ago

@ahundt, The first one arose due to a dependency to legacy codes and seems to be already resolved by you. The preprocessing and applications exploit public and stable APIs and do not depend to legacy codes.

The second one is orthogonal to the split. It even fails to import keras (AttributeError: module 'pandas' has no attribute 'core'). Why don't you try pip install tensorflow==1.7 or python: 3.6 or export LD_LIBRARY_PATH based on the master CI config?

ahundt commented 6 years ago

@taehoonlee yes, thanks. The purpose of my post was to provide a clear example so tradeoffs are clear in a more practical capacity than vague what-ifs. This is an admittedly very solvable problems that will need to be resolved around n times for n repositories.

fchollet commented 6 years ago

We'll do it then.

There's a technical problem we'll need to solve:

We need to be able to set what Keras module to use when importing preprocessing (or applications) from inside keras (or from inside tf.keras), so that things like keras.backend.set_image_data_format() get reflected in keras.preprocessing.
This is doable via a method like preprocessing.set_keras_module(sys[__name__])
This implies that we can, in this order: 1) import preprocessing without any reference to Keras 2) set the Keras module right after the import
However Iterator(Sequence) leverages a Keras symbol in the class definition (Sequence), which we would need at import time (so before we would have the Keras module available).

A possible solution is to ship Sequence with preprocessing. @Dref360, what do you think?

fchollet commented 6 years ago

For the record, I'm thinking of something like:

Contents of keras.preprocessing.image:

try:
   import keras_preprocessing  # Note: what should we name these modules?
   _check_version(keras_preprocessing)
   keras_preprocessing.set_keras_module(sys[__name__])
except ImportError:
   raise ImportError('...')

Iterator = keras_preprocessing.image.Iterator
# etc...

fchollet commented 6 years ago

Nevermind, we can structure the code in a way such that we can import the submodule and call set_keras_module before we enter the files where we use Sequence. However, as a result users won't be able to import standalone symbols from image, e.g. from keras_preprocessing.image import Iterator, rather you'd have to do from keras_preprocessing import image; image.Iterator or keras_preprocessing.image.Iterator.

I don't think it matters, because presumably the module would get used from Keras almost every time (from keras.preprocessing.image import Iterator would still work), and also because from keras_preprocessing import image is the best practice in terms of Python style.

fchollet commented 6 years ago

Side note: after consideration I think it would have been a bad idea to package Sequence into preprocessing, because:

We have the preprocessing.sequence submodule already, big semantic collision that would cause confusion
Sequence is used in training.py so it is more tightly coupled with the rest of the codebase, and it could be dangerous to unbundle it

fchollet commented 6 years ago

I've implemented this initial plan, but I think we're going to have to go with a different design to make it possible to use tensorflow (which will set the keras_preprocessing keras module to tf.keras) and keras in the same session without subtle problems.

fchollet commented 6 years ago

I've tried a couple different designs, but things are turning out to be incredibly complicated in practice when you start converting Keras to use keras_preprocessing which would refer to keras (in a way that would work coherently with tf.keras...).

I'm currently leaning towards a more verbose / advanced design, that would be hopefully safe:

in keras.preprocessing.__init__, import keras submodules utils, backend, then import keras_preprocessing
call keras_preprocessing.set_submodules(utils=utils, backend=backend) (defined in keras_preprocessing.__init__
in the keras_preprocessing.__init__ namespace, set keras_utils = utils, keras_backend = backend
in keras.preprocessing.image, do from . import keras_backend and use that

fchollet commented 6 years ago

Implemented that last design (more or less) and I believe it's safe.

One side-effect is that keras_preprocessing isn't usable in a standalone way (otherwise we'd have a circular import once Keras relies on it). There are 2 ways to use it:

import keras 
from keras_preprocessing import image  # for instance

Or (preferred):

from keras import preprocessing  # this is literally keras_preprocessing

fchollet commented 6 years ago

The first stage of the split is executed (creating independent modules for preprocessing and applications). The next stage is to get Keras and tf.keras to import them.

@taehoonlee in the process I've run into a NASNet bug -- error wrt incorrect number of layers when loading ImageNet weights. I've verified that the weights files are up to date.

taehoonlee commented 6 years ago

@fchollet, The error has been resolved in keras-team/keras-applications#1.

fchollet commented 6 years ago

Thanks for the PR, merged!

@Dref360 @taehoonlee as you have noticed the repos are created, and you have write rights on them! Here are a few TODOs I haven't done yet (true for both repos), that you can work on if you like:

Run CI against both the latest Keras release and Keras at HEAD. Currently we only run against the release. This should just be a small change to the travis.yml file.
Remove the unit tests for these modules from Keras tests, and instead add a couple high-level integration tests that should run quickly (e.g. for applications we could try loading a single model).

And please check out and review any PRs that our contributors send out 👍

taehoonlee commented 6 years ago

Maybe it is time to release 2.1.7, @fchollet?

fchollet commented 6 years ago

@taehoonlee yes, I will release this week. It will be 2.2.0 due to a large amounts of changes in this release.

ahundt commented 6 years ago

@fchollet Before release please consider checking on https://github.com/keras-team/keras/pull/7766

keras-team / keras

Proposal: Unbundling Keras modules into separate repositories #9256