Consistent naming of steps/commands

vineetbansal commented 5 months ago

At some point, we can remove all mentions of "inner_" in the command/step names, and go with the most logical name (it's probably still best to keep names like write_nn_tc so Michael can cross-reference his original implementation with the latest ones).

vineetbansal commented 3 months ago

@anushka255 , @skinnider - one of the problems I've had these past couple of months is not being able to retain in my head for any extended period of time what each step of the pipeline does at a high level. Part of this is because the names of the steps don't make it very sticky (plus the inconsistent naming like inner-* or inclusion of implementation details like RNN..) and might be a barrier for future contributors working on this.

I'm leaning towards single verbs, but also feel like the name should intuitively tell us the intention of the step including the paths that should go in and come out of it. So we can use compound words if it helps clear the matter. (I'm not particularly thrilled with collect/combine, for example). Here's my first stab at it. Can you let me know your thoughts?

dag

Note: the name of the step is on the first line, with the wildcard values, if any, on following lines. I'm using 2 instances of the steps wherever I can (e.g. train seeds/sample seeds/min freq) to make the branching representative of what might happen. I'm also not including the evaluation steps here (which we can handle a bit later).

skinnider commented 3 months ago

This makes a lot of sense to me. A couple suggestions:

I think a longer name for the "generate" step would help with clarity (perhaps "generate-training-sets" or "generate-splits"?)
for "collect/combine", maybe clearer could be "merge-across-seeds" vs. "merge-across-folds"?

skinniderlab / CLM

Consistent naming of steps/commands #121