Open dwf opened 13 years ago
Having coding conventions that everyone follows is a gain of time in the long term, due to making it easier to understand and update the code base.
I think it's a bit fascist to make style rules beyond PEP8 / Google Python Style. Please remember that we are all busy scientists. Time spent arguing about underscores (or should we actually make these rules, time enforcing adherence of underscore guidelines) is time not spent doing research or improving our tools.
That entirely depends on our reasons for putting together a library. If the point is to be one or two people's personal playground, then yes, conventions be damned.
If the point is for the library to grow into a well-documented library of readable code that
then yes, all this stuff matters, and I'm annoyed that you try to trivialize it as "arguing about underscores". As far as I can tell, time spent on code quality is time spent improving our tools.
I don't care what anyone does in their own sandbox for their own experiments. They can program in obfuscated Fortran for all I care. I certainly have my share of code that I wouldn't be proud of sharing. If you're going to put it in a library that's intended to be written by more than a couple of people and have it be useful for more than a couple of people, then you should care that it's as accessible as possible at the source level and a "finished product".
NumPy made these mistakes and spent hundreds of man hours and a bunch of University of Florida grant money correcting them. SciPy made these mistakes and is still correcting them, subpackage by painful gruelling subpackage. Theano made these mistakes and is in the awkward situation of growing beyond the lab in its use with mountains of untested, often unreadable and thoroughly undocumented code. All three of these packages are and have been extremely useful to many people but were written by "busy scientists" who refused to see what they were doing as engineering, worth doing right the first time.
Good, library-worthy code is at least, in part, self documenting. That implies code style, that implies convention. That implies the use of intermediate variables so as not to cram too much on one line. That includes descriptive identifiers, or at least a clear convention that when we say X
we are talking about a design matrix of dimension 2. That implies reviewed code, where reviews bring up everything from style to semantics to interface to documentation to testing. We haven't been using pull requests to review code for pylearn yet, but we should start, partly because it's important and leads to higher quality code, and partly because it will involve more people in the development process -- others will have to read code anyway in order to contribute in a sensible way, they might as well be reviewing in the process.
Many, many projects have made the decision to accumulate code and "features" at the expense of a high quality codebase, Theano included, and it makes things harder in the long run. I'm not saying that any one of my opinions on the above subjects is the right one but I do think they're all of some importance if we want pylearn2 to be anything more than a pet project full of partially complete, hard to read implementations that are hard to even find because one doesn't know what name to look for. @GaelVaroquaux has been a taskmaster about code style on scikits-learn, and it has paid off -- the code quality of the vast majority of the project is extremely high and contains many implementations I would happily point a colleague to as reference implementations of a particular algorithm, something to which pylearn2 should aspire.
Interesting thread. Hits at the heart of the challenge of developing reusable code, and doing productive research. It's hard to balance and I think they are often conflicting goals. Policies and recommendations are great to have around, but I think judgement on a case by case basis is required.
I think that the following things go together: shared code, code review, good quality, and slow movement.
Sometimes that's good, but not every project should be like that.
If you're trying to do an experiment. You want to be fast, just hack something and try it. You need fast turnaround, so you have your own forks & projects for that. They are comprehensible only to you, and you don't intend to keep them long term.
Then, for whatever reason, you want someone else to use that code... then you have to explain, document, and the other person has to read, and document. I think in practice, that this second stage of code life is the first one where you want to start 'caring about underscores'. It's not a mistake not to have the style right in the first phase... that's normal. In the second stage you should expect to rewrite all your phase 1 code. Don't feel bad about it. It's a hell of a lot faster to rewrite and document code than it is to find an algorithm & design worth sharing in the first place. The result of the phase 2 rewrite should be the sort of thing to get pull-requested to pylearn trunk. No one wants to be stuck doing code review on someone's unintelligible phase 1 hacking, and probably no one will ever need to.
I think it's a bit fascist to make style rules beyond PEP8 / Google Python Style.
Nice, Godwin's law
http://en.wikipedia.org/wiki/Godwin%27s_law
Please remember that we are all busy scientists. Time spent arguing about underscores (or should we actually make these rules, time enforcing adherence of underscore guidelines) is time not spent doing research or improving our tools.
In my opinion, this is a flawed argument: I am a busy scientist, the time I spent trying to memorize where you have put underscores and where you have not is time I spend not doing research.
IMHO, the real reasons for strict coding rules is because:
numpy.logaddexp
, but numpy.logical_and
. This gives an extra
cognitive load that gives no added value to the package, and makes it
harder for me to learn. And you should remember that the biggest
barrier to sharing code is that people hate learning other's
packages. That a source of a lot of time wasted and a lot of science
not done.Finally, the reason to choose to underscores is because it makes names
more readable. This stroke me the day I was look at a function called
mkufunc
. I couldn't parse this name, until I realized that it was
mk_ufunc
, and then it started making sens.
Underscores are just an example. In the scikit, we have found out that the more conventions, the easier it is to have shared ownership of code.
The "coding style" document produced out of the committee stuff did a pretty good job of things (save for the brewing disagreement over module renaming) but in order that we don't end up with think we need to discuss and standardize on things like
There's a fair amount of disagreement about a number of these issues, but I wanted to assemble therm in one place, along with comments for/against. I've tried to keep this post neutral in town and not colour it with my own opinion, but please add comments both expressing your view on a certain issue as well as correcting anything you feel I've unfairly represented.
Method/function arguments
X
,y
), something longer and more descriptive (inputs
,labels
), something else entirely?X
,y
etc.: makes complicated mathematical expressions easier to read, might correspond with usage in paper describing algorithminputs
,labels
, etc.: perhaps clearer to novices, complicated expressions could be broken up in terms of temporariesLearnable parameters
_params
attribute? e.g., scikits-learn uses a convention of suffixing all learnable parameter object members with a single underscore, e.g.self.weights_
. Of course pylearn2 parameters will be shared variables rather than numpy arrays, so perhaps this particular convention should be avoided to prevent confusion, but should we adopt something similar?Hyperparameters
sparsity_penalty
), by the name assigned to them in a particular paper (eta
,epsilon
), by some other convention? One complicating factor is thatlambda
is a reserved word in Python, so you can't name an object member lambda (just like you can't name it 'for' or 'else').Other variables
Class Names
Acronyms