Closed bartvm closed 8 years ago
Oh, and talking about Blocks development, I think that we also need to shift focus a bit to refactoring instead of writing new code. We've been adding lines at a nearly linear rate, and we can't keep doing that forever if we want to keep the framework "lean and mean" as it was initially intended.
@bartvm , shifting focus on refactoring requires blocks-contrib
, because people need a place to share the code.
Will comment the rest soon.
Sounds good to me. I can lend a hand in unit test cleaning/refactoring.
@EderSantana That's pretty much exactly the workflow I had in mind, thanks for the reference!
@rizar Yes, I'll try and find the time to set it up. It's a pretty time-intensive thing to do because it requires a lot of file renaming, import changes, etc. but I'll do it this weekend.
@vdumoulin Great, thanks :)
Is there a reason that we wouldn't use master as dev, and instead have a maintenance-0.1 branch? With most projects I expect to checkout the master branch and have the absolute latest and greatest, and have to do otherwise if I want something older. On Mar 20, 2015 11:31 AM, "Bart van Merriënboer" notifications@github.com wrote:
@EderSantana https://github.com/EderSantana That's pretty much exactly the workflow I had in mind, thanks for the reference!
@rizar https://github.com/rizar Yes, I'll try and find the time to set it up. It's a pretty time-intensive thing to do because it requires a lot of file renaming, import changes, etc. but I'll do it this weekend.
@vdumoulin https://github.com/vdumoulin Great, thanks :)
— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/517#issuecomment-84049891.
That works for me, so I guess we could basically follow the workflow that @EderSantana linked to, but with master
and dev
swapped to stable
and master
?
The two branch system is a good idea I think. Also having the master as the branch to which one should merge sounds good to me.
I am not sure I see why we should freeze master branch before conferences when there is a stable one. I would rather fork a new stable branch 4-6 weeks before a conference and suggest people to use it.
As for important issues before 0.1 is released, I would like to link #514. That's quite a nasty and a fundamental one.
As for blocks-contrib
: you don't have to move everything there in one go. Especially that tensions what should stay and what should go will start immediately. Move some old crap first, e.g. markov_chain.py
, set up Travis to test it and we can already redirect people to it and gradually clean up Blocks. In addition huge PRs are just a hell to review, as you know.
By the way, I would like to raise the question about reviewing policy in blocks-contrib
. It would be very nice to have code owners review contributions to their code themselves. We can just keep an eye that the repo contains reasonable things, but there is no way we can really review each PR like now and keep contributing to blocks-contrib easy and fast. I wish there were a way to give merge rights with directory restrictions, so that people were able to merge only PRs that change their code only.
As long as someone else does the review, I'm fine with pressing the merge button at the end, so I guess living without complicated access rights in blocks-contrib
is okay.
The feature freeze is important IMHO, because you can't just merge master
back into stable
and actually expect it to be stable. It needs a few weeks to cool down and for problems with the new features to show up and be fixed. I'd rather do this in a feature-frozen development branch than in the stable branch, because it would actually not be stable during that time. You could also do it in a separate branch, but then you end up with 3 branches: A stable branch, a development branch, and a stabilizing feature-frozen development branch. Hotfixes might need to be merged into 3 instead of 2 branches then, and fixes to the feature-frozen branch need to be merged into 2 instead of 1. That complicates things a lot.
I don't think that feature freezing is a big issue. You can still work on something in your own fork/branch, you just need to wait a few weeks before rebasing and merging into the main repo, which isn't that much to ask. One of the main reasons I like it though is because it sends a clear signal to people that we would prefer them to focus on bug squashing and documentation writing for a bit, instead of coming up with more and more features.
TL;DR: master
for latest WORKING version, dev
for a centralized place where new features are baked, zenodo DOIs for historical stable (and citable) versions.
@bartvm and @dwf indeed that git-flow is based in just a few extra git commands to keep master
as the latest release and move the working branch to dev
. Note that the approach in scikit-learn is simply having a master and a bunch of older stable versions.
The approach in in pylearn2 on the other hand is having everything in the master and new stuff in the sandbox folder. But note that this is only possible for features that don't touch other stuff. The lack of an actual dev
branch can cause issues like what happened recently with dataset iterators where not everybody liked the changes.
So basically, I would suggest having 0.X
branches for historical releases, master
for the latest WORKING branch. And a dev
branch for experimental changes that do not necessarily work everywhere. Here is why I think this is important:
master
branch is pretty good and that's what most people are expecting in research repos, like @dwf said.dev
can be replaced by individual forks, but a centralized dev
branch is a nice reference for people like me, that don't work on the same lab as you guys, to be able to help with new stuff. In some repos, PRs can only be submitted to dev,
for instance, and only bug fixes are accepted directly to master.Anyway, there are lots of ways to do that, I'm sure you guys will pick a nice solution for everybody.
Making a latest and hottest working master branch is pretty good and that's what most people are expecting in research repos, like @dwf said.
It seems to me that @dwf actually wants master
branch to be the one where development happens. @dwf, can you clarify?
Random question: do we want to have dev
, master
etc. for blocks-contrib? On one hand, people might be upset when it breaks a week before a conference deadline, one the other hand it is an additional pain in the ass.
@rizar I was definitely thinking that master
should be the hard-hat area. That's how Theano does it, that's how lots of other projects do it (numpy, scipy, matplotlib, IPython, etc.) We were talking about a separate dev
branch for pylearn2 mainly because of long-standing policies regarding master
; I think in the absence of that baggage, making master
the focus of all the action is perfectly alright.
I am a little unsure if blocks-contrib is a good idea. I'd actually say that a Blocks GitHub organization with lots of distinct projects would be the sensible way to implement the access control that we want.
Note that I think there are multiple kinds of use-cases for blocks-contrib that I think we need to distinguish between:
It seems to me like we want 1) to always be up to date with the core Blocks repository and try and verify that frequently, and treat it as "extended family" to the stuff in core. With 2) and 3) we can afford to simply tell people the commit hash or release version of Blocks or Fuel that these last worked with, and leave it with the author/maintainer to make sure they still work. You might even be able to automate that as part of a Travis build (recording the commit hashes, I mean).
I think 2) and 3) are good candidates for being separate repositories under the blocks organization umbrella, with stuff meeting the criteria of 2) moving from a separate repo to blocks-contrib when it is a) felt to be widely and generally useful, b) meets code quality and test coverage standards such that it will not impose a gigantic maintenance burden. It remains an open question in my mind whether Blocks Travis builds should run tests from blocks-contrib, and I'd be curious what you guys think. I'm leaning towards yes.
Miscellany: we should make anyone with Blocks merge rights have merge rights on riff-raff projects of 2) and 3) so we can accept pull requests that bring them back into working order, if the authors go MIA. We could also try and set up a less frequent, perhaps weekly build-bot for these addons.
@dwf
I'm happy with having master
be the development branch.
tl;dr I'm proposing three repositories, slightly different from @dwf:
blocks-extras
repo with drop-in extensions to the core, relatively well-tested and easy to install.blocks-contrib
repo with bits and pieces that can't go in blocks-extras
but are still useful enough to keep around. Untested.blocks-examples
repo with runnable examples and reproductions of papers. These are tested selectively.Long answer:
I agree that throwing everything into a single blocks-contrib
repository might not be a good idea. I'm not sure about the distinction between 'generally useful extensions' and 'machine learning bits and pieces' though. I'm worried the distinction between the two can sometimes be a bit unclear. I also think there could be machine learning tidbits in (2) that we would want to monitor for compatibility with the core.
Perhaps we could make a split more based on the level of support? Although that is arguably also a bit arbitrary. Your (1) could be blocks-extras
, which I would propose making a namespace package, mirroring the Blocks package structure within this namespace, so we'd have blocks.extras.bricks
, blocks.extras.extensions
, etc. This will make it trivial to move things between Blocks and blocks-extras
(a matter of copy-paste followed by s/.extras//g
). The criterion for inclusion is then basically "could potentially fit right into the Blocks core, but isn't general/well-tested enough (yet)". If things have stabilized for a few weeks/months and are used widely, we copy-paste them into the core. If something in the core falls into disuse, we copy-paste it into blocks-extras
. This repo would be tested along with the core, and we'll try to make sure we don't break stuff (although we reserve the right to in case of a big overhaul of the core, in which case we might need a bit of time before bringing blocks-extras
back up to date).
Then there's a lot of stuff that people might want to contribute which just isn't general enough, needs cleaning and/or tests, things that aren't maintained by anyone, etc. I think this is probably closest to your (2). It's still important to have this stuff lying around somewhere so that people don't need to reinvent the wheel (and maybe one day someone will clean it up and move it to blocks-extras
). We could call this blocks-experimental
, or maybe this could become blocks-contrib
. We don't test this as part of the core, and people are expected to tag their code with the commit that things last worked with.
A third repository we should have is one where we have examples and reproductions of papers. This is where we could move the current examples
directory. This won't be a Python package at all, just a collection of folders with Python files that people can learn from or use as starting points for their own projects. For this repository, I would test selectively: Some important examples (like our current mnist.py
) should be made sure to work and be tested as part of the core's Travis runs. However, larger examples (e.g. machine translation models) might take too long to compile to even test regularly, might need data that is not available on Travis, etc. These we can either leave untested (but tag with a commit) or test on a buildbot (if the lab ever gets Jenkins up and running). We could call this blocks-projects
or blocks-examples
.
@rizar I'm responding to https://github.com/bartvm/blocks/issues/520#issuecomment-84706834 here, because it's more relevant to this issue. You said:
In general I should say I not always like the line of code purity that you trying to enforce these days. Blocks have a couple of holes in which they do not deserve the nice 0.1 name, and this is one of them, and transition from coding fast and trying things to slow polishing of what we have should be gradual, taking into account user needs and opinions. As I apparently fail to convince you to take a more flexible stance, I think it's time to define Blocks management scheme.
I am by no means an expert in managing open source projects, but from the little experience I have with Pylearn2, I think that this line of "code purity" is necessary to make Blocks sustainable in the long-term.
I receive many suggestions and feature requests from people in the lab each week, and it's tempting to reply to all these requests with "Yes, a few lines here and there and we'll have something that will work well enough". What happens is that within weeks many people and other parts of the library have grown to rely on this functionality. Editing the code later on then requires changing dozens of examples, fixing tests, breaking people's code, etc. Sometimes it is better to say "No, this is a feature that will make it to Blocks eventually, but you're going to have to wait." It saves significant development time in the end.
The point of a 0.1 release after all is not that it is feature-complete. In fact, for me 0.1 means per definition that is not feature-complete. A 0.1 release should do the basics, but do those basics well, and right now we don't, yet. Fundamental issues such as lazy initialization (#337, #384, #39, #31), graph cloning (roles of replacements, #514), the Model
abstraction (#444), the inability to access the log during training (#460, #470, #476), lack of documentation, and subpar unit tests, remain.
The reason I am adamant about taking development slowly is because we have such limited development power. As time goes on, I'm now spending more time on patching things as we go along than I am on addressing what I feel are still fundamental issues. Although part of this is probably unavoidable as a framework matures, I do believe that on the long-term Blocks would benefit from sticking to the basics for now.
The three repository approach sounds pretty reasonable, provided we keep standards pretty high for blocks-extras
and are pretty strict about recording last working commit for blocks-examples
. (I think it would be good practice to run it in a minimal conda environment and then include the package list as well.)
I'm also partial to a more "look before you leap" approach to library development -- regarding pylearn2, I was in the minority, and the resulting impact on my stress levels has been significant.
In order to make this discussion more concrete, do you have a sampling of the sorts of things that a) are in Blocks now but you think should be moved to blocks-extras, and b) features that people have asked for/suggested that you believe are better deferred until the basics are more fleshed out?
Things that should move to blocks-extras
:
bin/blocks-plot
and the Bokeh plotting extension)To blocks-contrib
:
Things that should move to blocks-projects
:
blocks/examples
folderThat alone moves 15% of lines out of the core.
Features that have been suggested and or made PRs for that should either be postponed or go into blocks-extras
/blocks-contrib
.
Checkpoint
and Dump
Model
abstraction (#444)I'm in favour of switching to the extras/contrib/projects
organization.
I think we should clarify who's responsibility it is to maintain code in blocks-projects
. We clearly can't be resposible to make sure things don't break as people pour more and more models into it, and we should not force ourselves to make sure changes in the core library don't break examples in blocks-projects
.
Should we go as far as requiring that all examples in blocks-projects
have a clearly defined maintainer, otherwise they may be removed from the repo?
Maybe that's too strict and it's okay that certain examples aren't maintained as long as there exists a commit in blocks
for which they work and this commit is clearly indicated.
@bartvm Those all seem like reasonable moves re: blocks-extras.
Printing
in blocks-extras with something that has more bells and whistles. That said, I agree with @rizar that it would be nice to have it canonically addressed in core, at least eventually.Model
seem like incredibly convoluted discussions. Is there consensus? If so, I think a roadmap is in order. If not, perhaps a roadmap is still in order, indexing the forks in that road...@dwf
Ah, misunderstanding: The tickets I'm highlighting aren't tickets that I necessarily think should be fixed; they're tickets that I believe are symptomatic of core functionality needing work, or tickets that shouldn't be addressed until certain parts of the core functionality have been refactored. That is, I don't think it makes any sense to put work into #464 and #467 considering the current state of serialization; we should be spending time on merging Checkpoint
and Dump
instead of trying to patch up the current, semi-broken implementation.
The lazy initialization discussion isn't too bad. I've actually started some work on a branch where bricks are expected to specify which argument is needed for which stage, as discussed, and that works nicely. How and where to initialize exactly is really just a matter of preference after that.
The Model
discussion is indeed lengthy and convoluted. I'm not sure where we currently are regarding consensus. I have promised a counterproposal to #444, but haven't had the time to actually implement it.
A roadmap is pretty much what I hoped this issue would turn into :)
@vdumoulin
Not sure about requiring maintainers. I'm scared it'll make people hesitant to contribute, because they could be signing up for a lifetime of bug fixing. I'd encourage it, but have a listed, working commit as the only real requirement.
-1 on maintainers for examples. +1 on being strict about reproducibility instructions (conda env spec, commit hashes of relevant packages).
First of all nice that you guys rethought blocks-contrib, I agree that the concept was very vague and needed clarification. I have a few questions regarding the current split proposal:
blocks-extras
are allowed. I believe there is no intermediate answer here, only "yes" or "no". If the answer is yes, the difference between blocks-extras
and blocks
has to be clarified. Is it about perfection and reliability, like theano.sandbox
and pylearn2.sandbox
? Is it about generality and about being very specific to Blocks? I am not sure that beam search should to extras because this is a very essential part of all sequence generation experiments. It can be called a core code, because without it Blocks is just useless for quite a broad range of applications. As opposed to live plotting: one can definitely use Blocks without it. Maybe this is roughly where the line is.
@dwf, I like your idea of subclassing Printing
, but for this it has to be refactored a bit. A few simple modifications (in fact I have only one in mind) and all the rest can be done in blocks.extras
. And I do not think this should wait, because it's isolated from the rest of the Blocks and does not requite a very advanced coder.
Model discussion is indeed very convoluted, and it would be nice to hear your, @dwf, and other opinions.
FYI, Theano people are thinking in the same direction: https://github.com/Theano/Theano/issues/2839
Original suggestion:
After a certain development time, we go into feature freeze, and only work on cleaning up and stabilizing the dev branch. After that, the dev becomes the master branch (e.g. 0.2) and the cycle restarts.
I'd like to suggest that:
As a downstream developer what its frustrating is that there is no consensus on what version of blocks projects should be based on because everyone is installing from master
and stable
is now so old that it is bordering on irrelevant. For a project this young and dynamic, more frequent releases will help grow the developer community by offering more stable jump off points for projects and will reduce the chances for incompatibility between projects.
@dribnet , I definitely agree with you that we need to release more often. The problem is that Blocks is currently developed by researchers only, which activity they have to combine with their main one, respectively doing research. For this reason I don't believe that we can currently manage 6-weeks release cycle. 12-weeks sounds more realistic, but in fact over 12 weeks the codebase changes quite significantly.
Hopefully by keeping the Blocks codebase small we will live to the moment when interface stops changing so rapidly and releases will correspond to individual interface changes, not to huge packs of them. For now, quite a lot of refactoring is needed, which I am happy to have started by #759.
Thanks @rizar, appreciate the response and certainly understand that stabilizing a release can take time. I'll suggest a good in-between solution at this stage would be to do "minor checkpoint releases" once a month or so. Ideally (but optionally) for few days no one would check in large new features to master, but bugfixes would be welcome. Then the master branch could be tagged with a dev checkpoint (eg: 0.2.0-b1) before more speculative refactorings are merged in.
The issue with not having any recent tagged versions is that in practice each commit becomes it's own up-for-grabs dependency, and switching between any two blocks projects also entails re-installing that arbitrary chosen blocks and fuel dependency. This is what is evolving as the de-facto best practice that both external and internal projects. Even a completely arbitrarily tagging of the repo once a month could help mitigate these dependency conflicts and hopefully better encourage blocks adoption.
Actually, you convinced me. The situation when different projects point to different Blocks commits is a shame. And there is no point to have a even very stable branch when it is as outdated as ours.
I vaguely remember Bart telling me that there are some issues pip when a version identifies contains suffixes. So I would rather stick to major.minor.micro
as https://www.python.org/dev/peps/pep-0440/#public-version-identifiers suggests. Therefore we should choose between 0.0.2 and 0.1.0. I vaguely remember (though I could not find it this time) that the standard is to only use micro segment for changes that affect the interface, which means the next release should be 0.1.0.
@dwf, @vdumoulin , @dmitriy-serdyuk , what do you think about quickly a release now?
On Wed, Aug 5, 2015 at 3:14 AM, Dmitry Bogdanov notifications@github.com wrote:
Actually, you convinced me. The situation when different projects point to different Blocks commits is a shame. And there is no point to have a even very stable branch when it is as outdated as ours.
I vaguely remember Bart telling me that there are some issues pip when a version identifies contains suffixes. So I would rather stick to major.minor.micro as https://www.python.org/dev/peps/pep-0440/#public-version-identifiers suggests. Therefore we should choose between 0.0.2 and 0.1.0. I vaguely remember (though I could not find it this time) that the standard is to only use micro segment for changes that affect the interface, which means the next release should be 0.1.0.
Do you mean "only use micro segment for changes that do not affect the interface?"
@dwf https://github.com/dwf, @vdumoulin https://github.com/vdumoulin , @dmitriy-serdyuk https://github.com/dmitriy-serdyuk , what do you think about quickly a release now?
On the Fuel side I'd really like to see #201 #202 #208 fixed before we cut a release. These are all pretty big bugs.
@dwf , exactly, changes that do not affect the interface. But I can't remember where I saw this in fact.
Only https://github.com/mila-udem/fuel/issues/208 is left from your list, which should be fixed soon by https://github.com/mila-udem/fuel/pull/223/files. Should we declare development freeze before #223 is merged, so that when it is merged we can release?
Then there's the fact that RTD doesn't build for Fuel right now, which I also think we should fix.
Other than that, @rizar you have my support.
I feel this has been kind of lost out of sight, so maybe we should have a look at what really still needs to be done before we can argue we have something that could be a 0.1 release. I'm also going to propose a release cycle as I suggested for Pylearn2.
So for 0.1 we have the non-Theano monitoring (#349), which should be almost done. The bug in #159 needs to be fixed, and
blocks-contrib
finally needs to be set up. There is theNaN
functionality (i.e. using Theano'sassert
op to actually warn whenNaN
was encountered and avoided) talked about in #162, but maybe we can leave that for later? The rest seems to mostly be documentation, which we could arguably postpone till after 0.1. I would also like to finish up the new logging backends (#470/#476), and maybe we should spend some CCW workpower on cleaning up and refactoring the unit tests (many of them aren't really "unit" tests).As soon as we release 0.1, I propose we switch to the development workflow I suggested for Pylearn2, similar to e.g. Debian's. Quick recap:
master
gets branched into amaster
anddev
branchdev
branchdev
branch. After that, thedev
becomes themaster
branch (e.g. 0.2) and the cycle restarts.Since people don't like their branches to break leading up to conference deadlines, and because this is the time that most people don't have a lot of time to write new features, I would suggest roughly following a schedule that looks like:
If we want to, this means we can finish up the last new features (e.g. there is some Fuel work to be done) throughout April, clean up in May and release Blocks 0.1 then?