kboyd / Roc

Everything ROC and Precision-Recall curves.
BSD 2-Clause "Simplified" License
23 stars 7 forks source link

Merge work on main program into master #22

Closed afbarnard closed 10 years ago

afbarnard commented 10 years ago

I think it is time to merge my work on the main program into master. @kboyd, what do you think?

I can merge master into dev/afbarnard first if that would help you see and inspect the differences. (I'm not really sure what best practices are in this regard, so I guess we can set a precedent.)

This will close #12.

kboyd commented 10 years ago

@afbarnard, yes, I think you should merge master into dev/afbarnard first. I'm not sure if this should be precedent, but with how much the makefile has changed in master, I think it would be good to be able to test things out in dev/afbarnard at least for this pull request.

afbarnard commented 10 years ago

What do you think now, @kboyd?

I also think that merging master into a heavily-developed branch and then merging that branch into master is a good way to combine the lineages.

kboyd commented 10 years ago

A few thoughts, but in general it looks great.

afbarnard commented 10 years ago
  1. Agreed.
  2. The current (and intended at the time) behavior is --report and --to must be given in pairs. And, yes, I was using --to - when trying it out. I agree that it would be more convenient to be able to leave off --to if intending standard out; I am just not sure how best to do that. Your proposal of none or paired --tos sounds good. We could also pair as many as are available and then have the last --to carry on for the rest. We could also have all reports go to the next --to with an implicit --to - at the end if there wasn't a last --to. That seems the most flexible while still being intuitive, but it also assigns semantics to the order of the command line arguments which can be confusing (but I guess the order already matters in certain ways, specifically the reports are produced in order).
  3. I'm not entirely sure what you mean. You get what you ask for and nothing more. In any case, I think a warning would be distracting because --report rocarea --to - --report prarea --to - has legitimate use cases, i.e. I know I can parse the ROC area from the first line and the PR area from the second line. I realize it's a bit "use at your own risk" behavior, but I think it is well-defined.
  4. At first I didn't get what you meant, but I think you mean that the help should be printed on standard output instead of printed on standard error. Is that correct? If so, and if that seems to be common Gnu behavior, I will go ahead and copy the Gnu behavior.
afbarnard commented 10 years ago

With regard to what should go to standard output and what should go to standard error, I originally put --about and --license on standard error so that they would function as log-style messages, i.e. reports could be cleanly captured on standard output even when --about or --license were given on the command line. I could switch the output of those commands to standard output but then it seems like the program should immediately terminate after printing that information so that standard output doesn't contain extraneous information in a regular run. What do you think, @kboyd?

kboyd commented 10 years ago

I think --about and --license would behave the same way as --help: printing to stdout and immediately terminating. To me it doesn't make much sense to query the license and process data with the same invocation of the program.

kboyd commented 10 years ago

I'm not entirely sure what you mean. You get what you ask for and nothing more. In any case, I think a warning would be distracting because --report rocarea --to - --report prarea --to - has legitimate use cases, i.e. I know I can parse the ROC area from the first line and the PR area from the second line. I realize it's a bit "use at your own risk" behavior, but I think it is well-defined.

My concern is that

java ...Main --scores-labels in.csv --report rocarea --to - --report prarea --to - > out.txt

and

java ...Main --scores-labels in.csv --report rocarea --to out.txt --report prarea --to out.txt 

do not result in the same contents of out.txt. In the 2nd case, out.txt only contains the PR area.

kboyd commented 10 years ago

(2) I think paired --tos or none (and outputing to stdout) is the way to go as it seems the simplest and I don't think it's good to encourage putting multiple, unlabeled reports into a single file.

(4)Yes, I meant putting --help to stdout instead of stderr.

Edit: Markdown is zealous about numbered lists, so trying parens to keep my numbers correct with the point they address.

afbarnard commented 10 years ago

4.(Informational commands and stdout vs. stderr). Grep handles things the way you describe, i.e. treating the options as a single command and exiting thereafter despite other stuff on the command line. So I guess that is fine. We aren't logging anything anyway so having --about go to a log doesn't make much sense.

5.(Informational commands). Should --license be dropped in favor of incorporating more information into --about? If so, then what about --version? Should it produce a machine-parseable version string?

2.a.(Multiple tos bug). The difference between --to - ... --to - > out and --to out ... --to out should probably be considered a bug. I can reimplement things to only open files once.

2.b.(Multiple tos same file). As I said before, I don't see any reason to prevent reusing a file for multiple reports. It is well-defined and easy to understand what is happending even if the result might be confusing. In particular, I had in mind that one could also choose the report format and depending on the format it makes perfect sense even in the result to have multiple reports go to the same file. Consider the following reports of 'prArea' and 'rocArea' to the same file in config format and in YAML format. (Output is in "raw" format by default right now.)

Raw:

0.17001299
0.73259281

Config:

prArea: 0.17001299
rocArea: 0.73259281

YAML:

%YAML 1.1
---
prArea: 0.17001299
rocArea: 0.73259281
...

2.c.(Multiple tos CLI syntax). After a little mental digestion, I sort of like the in-order, sort-it-all-out-by-position syntax. Consider --report rocArea --report prArea --to aucs.yaml --report rocPts --format gnuplotscript --to - | gnuplot which saves the areas for later reference and makes a plot. Such a syntax is very flexible, avoids repetition, and should be intuitive. (It just doesn't satisfy CLI purists that believe options are inherently unordered.) I also just thought of this variant: --report rocArea,prArea --to aucs.yaml --report rocPts [--to -]. Eric thought that maybe it should be enforced that --to follows at least one --report.

Hmm. Seems things aren't ready for release after all.

kboyd commented 10 years ago

5.(Information commands). I like putting license info into --about and having --version be very simple like Roc v0.1.0. So we would have three informational commands, --help, --about, and --version, that print to standard output and immediately exit.

2.b.(Multiple tos same file). I agree now that reusing a file for multiple reports (provided files are only opened once as mentioned in 2.a) is reasonable and no warning or error is necessary.

2.c.(Multiple tos CLI syntax). I like the simplicity of matched --report and --to or no --tos, and allowing comma separated report names (e.g., --report rocArea,prArea) would still make it convenient to put multiple reports in a single output file. I would also be ok with your 'in-order, sort-it-all-out-by-position' idea, provided we can come up with a fairly succinct explanation of how it works. So something like "--report and --to options are processed in order, with reports being collected in a buffer until a --to is encountered, at which point the current report buffer is dumped to that file, and at the end of the command line arguments, if any buffered reports remain, they are dumped to standard out." I don't think we can (easily) avoid the CLI options needing to be order-dependent unless we restrict every invocation to have a single output file ... though that would certainly make incorporation of Roc into a build system more convenient. At the end of the day, the CLI behavior should be easily understood, both in the "how do I do X" and "what does this set of arguments do" sense. And we should avoid going even a little down the road to 'tikz', where the syntax and semantics are so overwhelming that the only recourse is to try to grok some set of related examples in order to figure out how to do some particular task (not to say we're anywhere near that, just that the more complex and difficult to grok the CLI is, the less likely anyone is to use the software).

afbarnard commented 10 years ago

@kboyd, I think I've fixed/addressed everything except 2.a. which would be good to fix but probably can wait.

kboyd commented 10 years ago

Sorry for the delay. I think you should go ahead and merge this to master.

afbarnard commented 10 years ago

2.a. is now issue #30 so everything should be ready for this merge.