natarajan-chidambaram / RABBIT

Apache License 2.0
2 stars 4 forks source link

Output results progressively #30

Open AlexandreDecan opened 5 months ago

AlexandreDecan commented 5 months ago

Hello,

Related to #29, the current behaviour of RABBIT is to retrieve all data (for all accounts passed as input), to process them all and then to display the results of all of them, at once. What about processing accounts "one by one" so that we don't have to wait for all the accounts to be processed to get the first results?

natarajan-chidambaram commented 5 months ago

This is already part of RABBIT as an input option https://github.com/natarajan-chidambaram/RABBIT#:~:text=%2D%2Dincremental, which the user can specify if they want the results as soon as it is predicted.

AlexandreDecan commented 5 months ago

Indeed, sorry ^^

AlexandreDecan commented 5 months ago

Are you sure that it's working correctly for a text-based output? Looking at the code, it seems that the "full" (up to the current contributor) dataframe is displayed within the loop... I guess that headers are also displayed each time a contributor is processed. Also, it does not really make sense to have this "incremental" mode for file-based output (you can't read a file if there are repeated write access to it, so what's the value of doing this?)

Please reopen if I'm right :-p

natarajan-chidambaram commented 5 months ago

(i) Yes, headers are displayed each time a contributor is processed in --incremental mode. Further, all the contributors that have been processed till the last processed contributor is displayed on the screen. (ii) In --incremental model, the file will be updated with the predictions and the user can already start to read and process it. For updated results they can just read the file again. Python did not complain that the file cannot be updated as it is open in another application. Please correct me if this is not what you meant.

I am reopening the issue as more discussion is required.

AlexandreDecan commented 5 months ago

(1) This makes is barely usable by users. Why not displaying the headers only once, then each time an account is processed, add a line to the output? (2) Incremental mode shouldn't apply for files. It does not really make sense. Moreover, you can quite easily turn into issues since reading the file implies a R-lock (from user's code) and writing to it (from rabbit) implies a RW-lock.

Another point is that, in the current version of RABBIT, it seems we cannot output json or csv on STDOUT (which can be really convenient when calling rabbit from code). I think we should give more control on the output to the users, to fit their needs without restricting the use-cases. One possibility would be to have three "groups" of options: one to control where the results are provided (STDOUT or a file), one to control how rabbit processes the output (incremental mode or "full output" mode), one to control the format of the output (text, csv, json, ...). These options could be --output (default to STDOUT, otherwise is a filepath), --incremental (default to true, since it gives a more "reactive" impression, but in that case perhaps we should revert the option and probably have a --process-all-like option), and --format defaulting to "text", allowing for "json" or "csv". I also suggest to add another "format" which is "jsonl" for "JSON line" (see https://jsonlines.org/) where each line is a json file (allowing to process each line separately, without having to wait for the process to wait to get a valid json file).

To make it even more user-friendly, without adding much to the complexity, we can also provide a --no-headers option (only applies for text and csv).

Basically, this implies some rewriting of the code (not much, that said), and could be very beneficial since it is likely we can drop "pandas" as a dependency if we proceed that way!

AlexandreDecan commented 5 months ago

Note that I do not think we need --output since it is easy to redirect the output of a CLI tool to a file (but I know that Tom prefers to have this option though). Also, by implementing the above suggested changes, it makes #29 easy to address: we can set that --incremental is the default (and only) behaviour (it will lower memory footprint, and there's no disadvantage of having it compared to the current behaviour!) and just add a --no-wait option (name should be discussed) that simply changes the loop condition (if not provided, and quota's reached, the loop waits, if provided and quota's reached, the loop ends).