lisa-lab / pylearn2

Warning: This project does not have any current developer. See bellow.
BSD 3-Clause "New" or "Revised" License
2.75k stars 1.09k forks source link

standardize on ML-specific code conventions #13

Open dwf opened 13 years ago

dwf commented 13 years ago

The "coding style" document produced out of the committee stuff did a pretty good job of things (save for the brewing disagreement over module renaming) but in order that we don't end up with think we need to discuss and standardize on things like

There's a fair amount of disagreement about a number of these issues, but I wanted to assemble therm in one place, along with comments for/against. I've tried to keep this post neutral in town and not colour it with my own opinion, but please add comments both expressing your view on a certain issue as well as correcting anything you feel I've unfairly represented.

Method/function arguments

odelalleau commented 13 years ago

Having coding conventions that everyone follows is a gain of time in the long term, due to making it easier to understand and update the code base.

dwf commented 13 years ago

I think it's a bit fascist to make style rules beyond PEP8 / Google Python Style. Please remember that we are all busy scientists. Time spent arguing about underscores (or should we actually make these rules, time enforcing adherence of underscore guidelines) is time not spent doing research or improving our tools.

That entirely depends on our reasons for putting together a library. If the point is to be one or two people's personal playground, then yes, conventions be damned.

If the point is for the library to grow into a well-documented library of readable code that

then yes, all this stuff matters, and I'm annoyed that you try to trivialize it as "arguing about underscores". As far as I can tell, time spent on code quality is time spent improving our tools.

I don't care what anyone does in their own sandbox for their own experiments. They can program in obfuscated Fortran for all I care. I certainly have my share of code that I wouldn't be proud of sharing. If you're going to put it in a library that's intended to be written by more than a couple of people and have it be useful for more than a couple of people, then you should care that it's as accessible as possible at the source level and a "finished product".

NumPy made these mistakes and spent hundreds of man hours and a bunch of University of Florida grant money correcting them. SciPy made these mistakes and is still correcting them, subpackage by painful gruelling subpackage. Theano made these mistakes and is in the awkward situation of growing beyond the lab in its use with mountains of untested, often unreadable and thoroughly undocumented code. All three of these packages are and have been extremely useful to many people but were written by "busy scientists" who refused to see what they were doing as engineering, worth doing right the first time.

Good, library-worthy code is at least, in part, self documenting. That implies code style, that implies convention. That implies the use of intermediate variables so as not to cram too much on one line. That includes descriptive identifiers, or at least a clear convention that when we say X we are talking about a design matrix of dimension 2. That implies reviewed code, where reviews bring up everything from style to semantics to interface to documentation to testing. We haven't been using pull requests to review code for pylearn yet, but we should start, partly because it's important and leads to higher quality code, and partly because it will involve more people in the development process -- others will have to read code anyway in order to contribute in a sensible way, they might as well be reviewing in the process.

Many, many projects have made the decision to accumulate code and "features" at the expense of a high quality codebase, Theano included, and it makes things harder in the long run. I'm not saying that any one of my opinions on the above subjects is the right one but I do think they're all of some importance if we want pylearn2 to be anything more than a pet project full of partially complete, hard to read implementations that are hard to even find because one doesn't know what name to look for. @GaelVaroquaux has been a taskmaster about code style on scikits-learn, and it has paid off -- the code quality of the vast majority of the project is extremely high and contains many implementations I would happily point a colleague to as reference implementations of a particular algorithm, something to which pylearn2 should aspire.

jaberg commented 13 years ago

Interesting thread. Hits at the heart of the challenge of developing reusable code, and doing productive research. It's hard to balance and I think they are often conflicting goals. Policies and recommendations are great to have around, but I think judgement on a case by case basis is required.

I think that the following things go together: shared code, code review, good quality, and slow movement.

Sometimes that's good, but not every project should be like that.

If you're trying to do an experiment. You want to be fast, just hack something and try it. You need fast turnaround, so you have your own forks & projects for that. They are comprehensible only to you, and you don't intend to keep them long term.

Then, for whatever reason, you want someone else to use that code... then you have to explain, document, and the other person has to read, and document. I think in practice, that this second stage of code life is the first one where you want to start 'caring about underscores'. It's not a mistake not to have the style right in the first phase... that's normal. In the second stage you should expect to rewrite all your phase 1 code. Don't feel bad about it. It's a hell of a lot faster to rewrite and document code than it is to find an algorithm & design worth sharing in the first place. The result of the phase 2 rewrite should be the sort of thing to get pull-requested to pylearn trunk. No one wants to be stuck doing code review on someone's unintelligible phase 1 hacking, and probably no one will ever need to.

GaelVaroquaux commented 13 years ago

I think it's a bit fascist to make style rules beyond PEP8 / Google Python Style.

Nice, Godwin's law

http://en.wikipedia.org/wiki/Godwin%27s_law

Please remember that we are all busy scientists. Time spent arguing about underscores (or should we actually make these rules, time enforcing adherence of underscore guidelines) is time not spent doing research or improving our tools.

In my opinion, this is a flawed argument: I am a busy scientist, the time I spent trying to memorize where you have put underscores and where you have not is time I spend not doing research.

IMHO, the real reasons for strict coding rules is because:

Finally, the reason to choose to underscores is because it makes names more readable. This stroke me the day I was look at a function called mkufunc. I couldn't parse this name, until I realized that it was mk_ufunc, and then it started making sens.

Underscores are just an example. In the scikit, we have found out that the more conventions, the easier it is to have shared ownership of code.