Closed fchollet closed 6 years ago
We could also do the same with dataset
. Less important though, since dataset
is for debugging/demo purposes, and not for doing any real work.
Minor concern :
Since preprocessing
is where we push all of your data augmentation API,
should Data augmentation Layers (proposed https://github.com/keras-team/keras/pull/9056/files) be in the keras/preprocessing or since it's a layer, we should push it on Keras?
We should define clear boundaries for each repos. Also, will others repos be allowed to add dependencies?
And most examples in keras/examples
will rely on all 3 repos. Should we add a separate setup.py for this folder to avoid circular dependencies?
And most examples in keras/examples will rely on all 3 repos
They would just depend on Keras. Preprocessing/etc should still be importable from Keras. This change would be a pure refactoring, where the code is moved to a different place but the APIs don't change.
should Data augmentation Layers (proposed https://github.com/keras-team/keras/pull/9056/files) be in the keras/preprocessing or since it's a layer, we should push it on Keras?
For now I would suggest Keras contrib. But conceptually a layer would definitely be in Keras, not keras-processing, regardless of what it does. The question is thus whether we want data augmentation layers as part of the core Keras API.
Okay, all fine then!
If anybody hasn't begun working on this, I would like to give it a go!
The two new repositories will need "owners". An owner is someone who has write rights and bears responsibility for reviewing PRs (and merging if they think it should be merged), and to a large extent, for answering issues. I will still give guidance on API choices, but that will be limited to user-facing APIs decisions.
@srjoglekar246 which repo are you interested in working on, keras-processing
, or keras-applications
?
@taehoonlee would you be interested in becoming owner of the keras-applications
repo, since you have done a lot of work on applications so far?
@farizrahman4u @ozabluda @ahundt do you have any interest in getting involved too (see comment above)?
keras-processing sounds good then!
@fchollet, I'd love to! I wonder if you intend to separate tests for the cores and the applications (as described in "Faster CI runs, better test coverage"). My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.
My minor concern is that the application tests will not be triggered whenever the cores are updated if the applications are unbundled.
Testing applications when changing layers would fall into "integration tests", not unit tests. Issues with the layers should be caught by the layers unit tests. We could add one application-related integration test to fix this.
OK, I understand.
@fchollet Does it make sense to start work on this with a new personal repo and then move it all to one owned by keras-team?
@fchollet
They would be listed as a dependency of keras, and would still be importable from e.g. keras.preprocessing.
For applications, this would be the other way around right? keras would be a dependency of applications.
There is some "preprocessing" code in applications as well, (imagenet_utils). Would that be moved to applications repo or preprocessing repo?
What about the examples? Separate repo?
Overall I think this is great. I use the numpy-only stuff in projects that do not actually use keras, refactoring it into a seperate repo would be really useful.
@farizrahman4u I see that you maintain one of Keras' spin-offs. I will probably look over the commits in your repo to get an idea of TODOs :-P
For applications, this would be the other way around right? keras would be a dependency of applications.
The split modules need to be importable from Keras, so they would be a dependency of Keras. We're not changing any API.
That means that keras-applications
will need to dynamically load Keras at runtime. This is required anyway since we'll want the same applications module to work for tf.keras
too!
There is some "preprocessing" code in applications as well, (imagenet_utils). Would that be moved to applications repo or preprocessing repo?
That's application-specific, so it would stay in the keras-applications repo.
What about the examples? Separate repo?
I think the examples are fine as they are, we don't have the same constraints there. They're already well-separated from the codebase itself (and the API), and they're not tested so they don't inflict additional load on CI. They're also not replicated in tf.keras
so we don't need a new "single source of truth" mechanism for them -- they're already in one place.
It's time for...
That's application-specific, so it would stay in the keras-applications repo.
A lot of people still use this function for their preprocessing.
In terms of timeline, I'll do the split, and it will take place in late March or early April.
This process can be managed fairly smoothly if you'd like to do it, but I'd like to provide some food for thought based on lessons I've taken from seeing this process unfold in several other widely used projects (boost, homebrew, and ROS are examples). Here are some questions I'd ask and answer:
At this time, the effects of items like "easier to sync with tf.keras" seem murky with respect to users and might benefit from better definition and/or alternative solutions. What if CI problems were solved with backend infrastructure changes that would be invisible to users? Could "owners" simply be assigned keras subdirectories? Everything might be totally worthwhile in the end, I'm just hoping to give perspective on alternative options.
The project might also consider taking the following steps, if appropriate. I believe they turned out to be enormously helpful on other projects:
Some of the above can probably come from pip so perhaps it isn't as big deal in this case when compared to others I'm familiar with.
Example issues that might come up include:
Hopefully this information is helpful with navigating the process for any approach you decide to take!
@ahundt thanks for the detailed thoughts. These are all very good points.
the effects of items like "easier to sync with tf.keras" seem murky with respect to users
The idea is that this is a refactoring change (refactoring to get less code duplication): it makes life easier for devs (which ultimately had indirect user benefits) while not affecting the experience at all for end users.
Specifically, we're trying to:
I agree that we need a clear plan to make sure that the user experience is not affected. I think automatic dependency management (pip) solves most issues.
Establish a clear versioning + dependency updating process in advance
We should push out new releases of preprocessing
and applications
every time we do a Keras release. Inversely, having the new modules do independent releases doesn't seem to be an issue, assuming that the new releases keep backwards compatibility (which they should).
Let's say current Keras is 2.1.8 and current preprocessing
is 1.0.1. We want to update Keras. Then we:
1) release preprocessing
1.0.2 which lists keras>=2.1.8
as a dependency
2) (same for applications
)
3) release keras
2.1.9 which lists preprocessing
1.0.2 as a dependency
This implies that the bleeding edge version of preprocessing
and applications
is always compatible both with the latest Keras release and with the bleeding edge of Keras.
Responsibilities for between repository breakage should be clearly defined
Thankfully we're in a situation where this will not happen. If this was a risk we would not do it. In our case preprocessing
and applications
are very well scoped and separate from core Keras. Keras doesn't depend on them at all, and they're just consumers of a stable subset of the Keras API (Sequence
and layers/models respectively). Any change in Keras that would break the new modules would be in itself a bug, since Keras isn't supposed to make breaking changes to these APIs. In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).
What we should do is CI-test the new modules both against the last release of Keras, and against the bleeding edge version of Keras. That way we ensure that changes in bleeding-edge modules don't break any existing users, and that they are ready to be used with bleeding-edge Keras. This guarantees an additional layer of stability.
(the reverse is not a problem since the Keras logic does not depend on the new modules, the pip dependency will only be to keep the existing namespace keras.preprocessing
/etc).
Avoid assuming "you should always track either master / the latest release" will be feasible for everyone.
I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.
Think about how additional policy changes will be made if the process doesn't turn out as expected.
If we have intractable issues, we can always revert to the current state of affairs at any time, since:
Provide clear messages and automatic detection / acquisition of the correct dependencies with a way to override it.
pip
+ clear, actionable error messages in case of ImportError
should suffice here.
This sounds like a pretty reasonable plan to me. I'll just mention a couple additional potential pain points to consider based on your comments:
In particular the new modules do not and should not consume any private/unstable Keras API (which is not an issue at all given how little coupling there is between the modules).
This would definitely mitigate many potential issues. I think to make this happen the following steps might be worthwhile:
Responsibilities for between repository breakage should be clearly defined
Thankfully we're in a situation where this will not happen. If this was a risk we would not do it.
I'll mention a couple specific examples of conflicts that showed up later in travis. keras-contrib actually broke quite a few times over the first few months due to internal keras changes for which internal calls were not easily avoidable without duplication. Examples include:
_obtain_input_shape()
API changedEverything can be resolved if there are cycles available to fix it, I just wanted to give a couple of specific examples of pain point categories that will multiply as the number of repositories increases.
I think this is solved by having the new modules be compatible with both the bleeding edge and the latest release.
Someone, or many individuals might deploy code in production and not be able to update for one valid reason or another. One option is to have periodic LTS releases are the typical way to meet the needs of groups like that, but it isn't the only way to go.
I hope that helps, overall your proposed plan sounds reasonable! Thanks for taking my ideas into consideration.
I've been delaying doing this because I wanted to think about it for a while and make sure this was the right thing to do. Thanks everyone for the various comments and feedback! Now, let's make a decision.
@Dref360 @taehoonlee: should we do this split of preprocessing
and applications
into separate repositories, and would you take ownership of the split repos? (@Dref360 for preprocessing
and @taehoonlee for applications
?
We'll make a decision asap based on your answers, and do the split next week if the decision is to do it.
We should definitely do the split for keras.applications, which would be the "official" keras-zoo.
For keras.preprocessing, I think your motivations are valid. And the community will be much more easier to manage with it. I'm kinda worried about the
That means that keras-applications will need to dynamically load Keras at runtime.
We should be very cautious with that.
A user that tries to do import keras.preprocessing
and gets Cannot find keras
with a traceback that goes through keras-preprocessing repositery will be confused.
Also, we would need CI for multiple keras versions and deprecate the versions overtime.
So overall I would like the repo to move foward with the split and we should follow every @ahundt advices. I think all of them are valid concerns.
On the ownership, I can take care of keras.preprocessing, I would need some help with keras.preprocessing.sequence since it's not my main area of research. I am open to get some help from other contributors as needed. :)
Cool, thanks @Dref360!
@taehoonlee, any thoughts?
Also, we have tons of PR waiting for those modules. When the split is done, we would have to go through it.
@fchollet, It would be nice if we could do the split. Especially, I strongly agree with:
keras-team/keras
and tf.keras
" [1].There are no benefits as @ahundt pointed out. However, in my opinion, this is not a problem because the intention of the split is to facilitate management without harming the user experience.
Of course, the potential issues may actually occur. I think that the issues are not the reasons why we should not split, but the matters we need to care about in the future.
And, I'd like to take ownership of the keras-team/keras-applications
.
keras-contrib currently has a specific, clear example (travis link) of the sort of breakage I mentioned might be a concern in unbundled modules. I believe master moved some files around and removed some legacy code, and now the keras-contrib build doesn't work.
https://github.com/keras-team/keras-contrib/pull/263
I haven't figured out a fix yet, but haven't spent much time on it either. This is just for awareness that this sort of thing would become much more common, and the person who makes the change many not realize or have a chance to fix the issue downstream. This sort of thing also doesn't show up until travis automatically re-runs the dependency, which in the case of keras-contrib is up to a 24 hour delay.
@ahundt, The first one arose due to a dependency to legacy codes and seems to be already resolved by you. The preprocessing and applications exploit public and stable APIs and do not depend to legacy codes.
The second one is orthogonal to the split. It even fails to import keras (AttributeError: module 'pandas' has no attribute 'core'
). Why don't you try pip install tensorflow==1.7
or python: 3.6
or export LD_LIBRARY_PATH
based on the master CI config?
@taehoonlee yes, thanks. The purpose of my post was to provide a clear example so tradeoffs are clear in a more practical capacity than vague what-ifs. This is an admittedly very solvable problems that will need to be resolved around n
times for n
repositories.
We'll do it then.
There's a technical problem we'll need to solve:
preprocessing
(or applications
) from inside keras
(or from inside tf.keras
), so that things like keras.backend.set_image_data_format()
get reflected in keras.preprocessing
.preprocessing.set_keras_module(sys[__name__])
preprocessing
without any reference to Keras 2) set the Keras module right after the importIterator(Sequence)
leverages a Keras symbol in the class definition (Sequence
), which we would need at import time (so before we would have the Keras module available).A possible solution is to ship Sequence
with preprocessing
. @Dref360, what do you think?
For the record, I'm thinking of something like:
Contents of keras.preprocessing.image
:
try:
import keras_preprocessing # Note: what should we name these modules?
_check_version(keras_preprocessing)
keras_preprocessing.set_keras_module(sys[__name__])
except ImportError:
raise ImportError('...')
Iterator = keras_preprocessing.image.Iterator
# etc...
Nevermind, we can structure the code in a way such that we can import the submodule and call set_keras_module
before we enter the files where we use Sequence
. However, as a result users won't be able to import standalone symbols from image
, e.g. from keras_preprocessing.image import Iterator
, rather you'd have to do from keras_preprocessing import image; image.Iterator
or keras_preprocessing.image.Iterator
.
I don't think it matters, because presumably the module would get used from Keras almost every time (from keras.preprocessing.image import Iterator
would still work), and also because from keras_preprocessing import image
is the best practice in terms of Python style.
Side note: after consideration I think it would have been a bad idea to package Sequence
into preprocessing
, because:
preprocessing.sequence
submodule already, big semantic collision that would cause confusionSequence
is used in training.py
so it is more tightly coupled with the rest of the codebase, and it could be dangerous to unbundle itI've implemented this initial plan, but I think we're going to have to go with a different design to make it possible to use tensorflow
(which will set the keras_preprocessing
keras module to tf.keras
) and keras
in the same session without subtle problems.
I've tried a couple different designs, but things are turning out to be incredibly complicated in practice when you start converting Keras to use keras_preprocessing
which would refer to keras
(in a way that would work coherently with tf.keras
...).
I'm currently leaning towards a more verbose / advanced design, that would be hopefully safe:
keras.preprocessing.__init__
, import keras submodules utils
, backend
, then import keras_preprocessing
keras_preprocessing.set_submodules(utils=utils, backend=backend)
(defined in keras_preprocessing.__init__
keras_preprocessing.__init__
namespace, set keras_utils = utils
, keras_backend = backend
keras.preprocessing.image
, do from . import keras_backend
and use thatImplemented that last design (more or less) and I believe it's safe.
One side-effect is that keras_preprocessing
isn't usable in a standalone way (otherwise we'd have a circular import once Keras relies on it). There are 2 ways to use it:
import keras
from keras_preprocessing import image # for instance
Or (preferred):
from keras import preprocessing # this is literally keras_preprocessing
The first stage of the split is executed (creating independent modules for preprocessing
and applications
). The next stage is to get Keras and tf.keras to import them.
@taehoonlee in the process I've run into a NASNet bug -- error wrt incorrect number of layers when loading ImageNet weights. I've verified that the weights files are up to date.
@fchollet, The error has been resolved in keras-team/keras-applications#1.
Thanks for the PR, merged!
@Dref360 @taehoonlee as you have noticed the repos are created, and you have write rights on them! Here are a few TODOs I haven't done yet (true for both repos), that you can work on if you like:
And please check out and review any PRs that our contributors send out 👍
Maybe it is time to release 2.1.7, @fchollet?
@taehoonlee yes, I will release this week. It will be 2.2.0 due to a large amounts of changes in this release.
@fchollet Before release please consider checking on https://github.com/keras-team/keras/pull/7766
We should consider moving the modules
preprocessing
andapplications
to separate repositories underkeras-team
:keras-team/preprocessing
andkeras-team/applications
.They would be listed as a dependency of
keras
, and would still be importable from e.g.keras.preprocessing
.Why?
keras-team/keras
andtf.keras
(shared dependencies instead of code duplication)Any comments or concerns? Who would be interested in taking greater ownership of these repos, should they materialize?