Outputs revamp - Githubissues

hdoupe commented 6 years ago

Implements the prototype in #916. A Tax-Calculator PR will be opened and worked on in tandem with this PR in order to prepare Tax-Calculator for the new outputs standards.

Goals:

create simple HTML page to render the tables created by Tax-Calculator
simplify compute.py logic corresponding to @lucassz's work in OpenSourcePolicyCenter/pb_deploy#1 *
move new outputs logic and worker-node facing code to a new app core **
implement simpler database schema
keep backwards compatibility (can still view old results)

* Once this is merged, there will only be one AWS endpoint per worker node cluster to query instead of one for each worker node. Thus, there is no need to implement our own round robin logic--this is done by celery. When a job is posted, it is placed on celery's task queue (which is run via redis) and celery pulls those jobs in a LIFO manner. I think we should implement this simplifying compute.py logic and the ability to display the new tables simultaneously because these sections of the code base are tightly coupled. This means that changing one requires significant changes to the other. So, changing them at the same time will be a lot of work but not much more work than changing them sequentially. I'm happy to discuss further if others have differing opinions. ** To accomplish the goal of turning PolicyBrain into a platform (#906), more code will need to be shared between apps. Moving code that can be used by multiple apps into this new core app is a step in that direction. Also, see #816.

hdoupe commented 6 years ago

@lucassz I left core/models.py for you to fill out. One way to do this is to open a PR on my branch outupts. An alternative is to create an ouptuts-dev branch that we can work off until we are ready to push the new outputs into production. I'm partial to the latter approach as this leaves the master branch in a production ready state.

hdoupe commented 6 years ago

@lucassz do we still need the taxbrain/compute.py? It looks like Compute is now in the core app. Next, we could move all of the mock compute classes to test_assets or in the taxbrain tests directory.

lucassz commented 6 years ago

@hdoupe I only temporarily kept normalize and taxbrain/compute.py for the purpose of not breaking things in B-Tax for now. Otherwise, do you know of an easy way to deactivate the B-Tax app, so we can get rid of those things and still have the webserver launch?

I will also remove the not_avail_context part.

I agree with moving the URL routing and error storage into CoreRun, I will do that next (along with the outputs).

hdoupe commented 6 years ago

@hdoupe I only temporarily kept normalize and taxbrain/compute.py for the purpose of not breaking things in B-Tax for now. Otherwise, do you know of an easy way to deactivate the B-Tax app, so we can get rid of those things and still have the webserver launch?

Ok, I see that makes sense. Nope, I'm not sure how to do that.

I will also remove the not_avail_context part.

I agree with moving the URL routing and error storage into CoreRun, I will do that next (along with the outputs).

Great, thanks.

lucassz commented 6 years ago

@hdoupe and I have settled on an API for the core app which will be completely abstract and composed of two models to be subclassed by individual applications, CoreRun and CoreInputs, which are linked by a OneToOneField on the former (this needs to be set up in sub-classes). The end game for this would be to eventually have a single outputs module that is plugged into each webapp's URL router with the proper models as parameters. Ideally, it would intelligently query the right URL on the distributed server according to the webapp's name, such as http://ospcapi.org/<webapp>/get_results whereas we currently use /dropq_get_results for TaxBrain.

One design challenge we've been dealing with is how to structure things so that tables can be returned from the distributed servers and displayed in a way that's generalizable. The approach I would currently favor is for upstream software to simply return a list of tables, so the result from a dropq_get_results query could be something like:

{'outputs': [{'title': 'Distribution table', 'renderable': '…', 'downloadable': '…'}, …], 'upstream_vers': '1.2', …}

I've been working on displaying these tables in a straightforward way, to hopefully reduce the dependency on complex BackboneJS-based logic currently in taxbrain-tablebuilder.js. I'm still planning on maintaining the current use of the DataTables front-end package, although there doesn't seem to be any strict need for it in the simplest iteration of this PR.

While the upstream packages would provide HTML tables with content, there would most likely be a function on the PolicyBrain side that would inject the supplied title as a caption to the table.

hdoupe commented 6 years ago

@lucassz thanks for summarizing our thought process so far.

I'm going to outline the schema that I have in mind to catch people up and to make sure that we are on the same page:

CoreRun

renderable_outputs - html tables (or other content to be embedded)
- download_only_outputs - raw data like CSV files that is meant to be downloaded
uuid - unique hash for URL lookups
error_text - stacktrace if error is thrown by upstream package
user - user who submits run
exp_comp_datetime - time for expected completion; used for status page with loading bar
job_ids - list of ID's corresponding to the tasks submitted for this model run
upstream_vers - version of upstream package used for simulation
webapp_vers - version of webapp used to generate the results

CoreInputs

raw_input_field - data as submitted by inputs page
input_fields - data that has been parsed to have the correct data types
json_text - JSON object with data as it has been reformatted to be submitted to upstream package

By OneToOneField, we mean that for each row in both tables there is exactly one corresponding row in the other table.

Each app (e.g. TaxBrin, CCC, OG-USA App) implements both of these classes, adds fields as needed (ideally this is avoided) and adds data processing methods (avoid these too). Some implications for this are:

there will be an input table and output table for each app
there is an option to customize the schema and data processing logic for apps if it is necessary

@lucassz is this what you are thinking?

hdoupe commented 6 years ago

Ideally, it would intelligently query the right URL on the distributed server according to the webapp's name, such as http://ospcapi.org/<webapp>/get_results whereas we currently use /dropq_get_results for TaxBrain.

This sounds good to me. One modification that could be made is to add a v1 somewhere like http://ospcapi.org/v1/<webapp>/get_results. That could make it easier to make breaking changes to the inputs or outputs schema.

Does that make sense?

hdoupe commented 6 years ago

@martinholmer @jdebacker you may be interested in this:

One design challenge we've been dealing with is how to structure things so that tables can be returned from the distributed servers and displayed in a way that's generalizable. The approach I would currently favor is for upstream software to simply return a list of tables, so the result from a dropq_get_results query could be something like: {'outputs': [{'title': 'Distribution table', 'renderable': '…', 'downloadable': '…'}, …], 'upstream_vers': '1.2', …}

title - title of output item
renderable - if present, should be some type of data that can be embedded, e.g. an HTML table or bokeh plot (like the CCC bubble plot)
- this could be empty if the output is simply a data dump, e.g. CSV with MTR's for the CPS file
downloadable - raw output corresponding to renderable data, e.g. a CSV file corresponding to an HTML table
- can be empty (if the renderable item is just meant to be rendered and not downloaded)
- can be present while renderable is empty (see MTR's for CPS file example above)

As upstream package maintainers, do you think an outputs structure like this makes sense? Do you have any ideas for how to improve this structure? Can you think of any disadvantages to using this structure?

@lucassz is this in line with what you are thinking?

lucassz commented 6 years ago

I'm going to outline the schema that I have in mind to catch people up and to make sure that we are on the same page:

@hdoupe What you wrote mostly matches up with my understanding of it. Some notes:

I am now thinking outputs would be a single JSON field within CoreRun, with renderable, download_only and title top-level attributes.
So far I have made CoreInputs totally abstract (with no fields at all), but I like your idea of including raw_input_fields, input_fields and json_text as a base. This would allow for the establishment of a convention that a well-implemented app shouldn’t need to extend the model at all, since those 3 fields could be enough to implement e.g. TaxBrain (although this is far from the case right now).
As you mentioned, there would be a one-to-one correspondence between input rows and output/run rows; in theory, it only means that there is at most one input for each output and vice-versa, which allows us to create just inputs first, for instance. We can then easily get somerun.inputs or someinput.corerun

One modification that could be made is to add a v1 somewhere like http://ospcapi.org/v1/<webapp>/get_results.

Good call. As we had previously discussed, we can use the Amazon ELB set up in OpenSourcePolicyCenter/pb_deploy#1 to direct traffic to different nodes based on the desired version if needed.

As upstream package maintainers, do you think an outputs structure like this makes sense? […] @lucassz is this in line with what you are thinking?

Yes, that sums it up very well.

hdoupe commented 6 years ago

@lucassz noted:

I am now thinking outputs would be a single JSON field within CoreRun, with renderable, download_only and title top-level attributes.

Sounds good to me

So far I have made CoreInputs totally abstract (with no fields at all), but I like your idea of including raw_input_fields, input_fields and json_text as a base. This would allow for the establishment of a convention that a well-implemented app shouldn’t need to extend the model at all, since those 3 fields could be enough to implement e.g. TaxBrain (although this is far from the case right now).

Great, that's a good way to think about it.

As you mentioned, there would be a one-to-one correspondence between input rows and output/run rows; in theory, it only means that there is at most one input for each output and vice-versa, which allows us to create just inputs first, for instance. We can then easily get somerun.inputs or someinput.corerun

Ah, I see. Thanks for adding a more nuanced take.

hdoupe commented 6 years ago

@lucassz when you fee like the table schema is stable, can you check the migration files in?

lucassz commented 6 years ago

@martinholmer @jdebacker

A few updates: the state of the new outputs feature has now evolved to where PB is able to display all the relevant tables, both by year and aggregate, and allow for selecting relevant tables and columns in a way equivalent to the old implementation.

Upstream computation Particularly, the aggregation of the data from different years has been moved to the upstream project. The way this is done is that a function such as run_nth_year_taxcalc_model returns a tuple of two when it is called per year: the first is the new output JSON containing tables for rendering and downloading, and the second is some raw data that it wants to be passed back for aggregation.

In the case of our prototype for Tax-Calculator, the year-by-year hook returns as the second element a dictionary mapping the desired aggregate table to a list of data that will be needed for it. Concretely, this means the second element of the result of the year-by-year hook function is something like {'aggr_d': [(0, '<DataFrame JSON for year 0>')], 'aggr_1': …, …} Once all the year-by-year jobs have been completed, the worker calls run_taxcalc_years_aggregation with that same data structure (combining the lists key-by-key), and the upstream projects generates the renderable and downlodable outputs.

The interface for the downloadable outputs, both year-by-year and aggregate, has changed to being a list of files per output, thus allowing for files in different formats corresponding to a single HTML table. The overall data structure returned by run_nth_year_taxcalc_model (in the first element of the tuple) is thus {'outputs': [{'downloadable': [{'filename': 'xx.csv', 'text': '<raw CSV>'}, **…**], …}, …]}. In difference with what Hank had previously outlined, notice that we also add the specification of a filename by the upstream project.

Downloading The outputs are no longer downloaded table-by-table — previously, they were generated from the table HTML via JavaScript — but rather as a single ZIP archive containing all the output files, which is generated server-side on each request to /<project>/<id>/download, which is linked to by the output display page as well.

Computation job management Another difference is that, thanks to some changes in OpenSourcePolicyCenter/pb_deploy#1, it is no longer necessary or useful for the computation jobs to be split among a number of destination servers, and it is thus also not useful for PB to keep track of more than one job ID. As such, there is now just one job ID per model run. On the Celery side, this is implemented as a chord which has the aggregation feature described above as a callback.

HTML output display This is probably one of the more complex recent changes, and it has to do with the way that tables are selected on the outputs page. As far as TaxBrain goes, the previous implementation essentially hardcoded a way to filter the available tables as to select one at a time, along several user-friendly dimensions. The new approach is "tags," which is a generalization of this approach with table categories that are not tied to the table's title (as they were in the previous implementation) and are fully customizable, for both aggregate and yearly tables, by upstream projects. A "tag" is essentially a toggle between different options, and a key feature is that these options can have "children" tags (sets of options). This feature was key in order to maintain the functionality in TaxBrain, which allows for choosing either distribution or difference yearly tables, and shows different options based on the first selection. For illustration, here is the interface and the way it's used in TaxBrain.

The configuration by individual applications for these is provisionally in the output model, but will probably be in a class-based view soon. The point is that these settings are used to build up the table selection UI, with some JavaScript code then displaying the tables accordingly. Labeling the tables is done by the upstream project and is exceedingly simple, with each output having a 'tags' attribute that is a dictionary mapping tags to the tag option that corresponds to the table. The relevant client-side code is relatively simple jQuery, and the entire solution is much simpler and yet more flexible and reusable than the previous 500+-line taxbrain-tablebuilder.js.

Please let me know if there's anything else I should clarify.

martinholmer commented 6 years ago

@lucassz said:

The way this is done is that a function such as run_nth_year_taxcalc_model returns a tuple of two when it is called per year: the first is the new output JSON containing tables for rendering and downloading, and the second is some raw data that it wants to be passed back for aggregation.

Never return a tuple! @hdoupe and I spent hours moving away from this error-prone approach. Always return a dictionary.

lucassz commented 6 years ago

@martinholmer Using a dictionary for a predetermined data structure isn't quite preferred either, but I can move to that if you think it would be better. I guess options are limited when it needs to be serialized to JSON.

martinholmer commented 6 years ago

@lucassz said:

Using a dictionary for a predetermined data structure isn't quite preferred either

What is "preferred" and why is it better than a dictionary?

lucassz commented 6 years ago

What is "preferred" and why is it better than a dictionary?

What might be preferred would be to have separate functions to return the individual tables for each year and the tables to be aggregated for each year. Would that solution be good with you?

hdoupe commented 6 years ago

I'm OK with the current solution but with @martinholmer's input that tuples can be confusing once they get longer. The background for this is that we were passing around a tuple of like 15 tables and it became very difficult to work with.

I think a dictionary is just fine for now. A dictionary is easier to work with and modify. Further, it gives us the flexibility to order the outputs by multiple parameters. Its advantage over namedtuples is that it can easily be converted to JSON and back. Consider this somewhat contrived case:

Suppose we allow the user to select a set of reform parameters and run those over the year parameter space and a data_source parameter space. Then, the intermediate results could be stored as:

{'aggr_d': [
     {'year': 0, 'data_source': 'puf', 'data': '<DataFrame JSON for year 0>'}, ], 
'aggr_1': …, …}

@lucassz @martinholmer does this seem sensible?

hdoupe commented 6 years ago

@lucassz said:

What might be preferred would be to have separate functions to return the individual tables for each year and the tables to be aggregated for each year. Would that solution work for you?

I'm not sure if I follow. What do you mean by different functions?

lucassz commented 6 years ago

@hdoupe Yes, I agree with using a dictionary for the internal structure of the list of DFs. I think the less clear-cut question is what to return from the function as a whole, which is what @martinholmer thought should be returned as a tuple. What you described is homologous to the second element currently returned by run_nth_year_taxcalc_model (which stores DFs for future processing into HTML/CSV/etc. tables), whereas the first element is the already processed non-aggregate tables. The question is whether to return these as processed_tables, dfs_to_aggregate or {'processed_tables': processed_tables, 'dfs_to_aggregate': dfs_to_aggregate}. I will switch over the internal structure of the PDFs-to-aggregate object to what you mentioned, but that question still remains.

martinholmer commented 6 years ago

@lucassz said:

Yes, I agree with using a dictionary for the internal structure of the list of DFs. I think the less clear-cut question is what to return from the function as a whole, which is what @martinholmer thought should be returned as a tuple. What you described is homologous to the second element currently returned by run_nth_year_taxcalc_model (which stores DFs for future processing into HTML/CSV/etc. tables), whereas the first element is the already processed non-aggregate tables. The question is whether to return these as processed_tables, dfs_to_aggregate or {'processed_tables': processed_tables, 'dfs_to_aggregate': dfs_to_aggregate}. I will switch over the internal structure of the PDFs-to-aggregate object to what you mentioned, but that question still remains.

I've never advocated returning a tuple from a function (as @hdoupe explained). Dictionaries are better when there are more than a couple of returned objects.

The run_nth_year_taxcalc_model function is T.J.'s creation, not mine. The original design called for returning a 15-element tupl. All I've done is return the 15 elements in a dictionary so that they are named and programmers don't have to have memorized what table is stored in res_tuple[13].

I'm not going to spend any time redesigning the logic of the run_nth_year_taxcalc_model function. And I would strongly advise pull request #2043 not to attempt such a redesign.

lucassz commented 6 years ago

@martinholmer said:

I've never advocated returning a tuple from a function (as @hdoupe explained). Dictionaries are better when there are more than a couple of returned objects. […] All I've done is return the 15 elements in a dictionary so that they are named and programmers don't have to have memorized what table is stored in res_tuple[13].

I'm not sure if we are referring to the same thing -- I thought we were discussing whether it was appropriate to return a 2-tuple (and always a 2-tuple) from run_nth_year_taxcalc_model or its homolog in another upstream project, which seems much less prone to such issues as a 15-tuple understandably would. Are you referring to something else, or are you saying you find this issue applies categorically even to 2-tuples?

hdoupe commented 6 years ago

@lucassz commented/asked:

The question is whether to return these as processed_tables, dfs_to_aggregate or {'processed_tables': processed_tables, 'dfs_to_aggregate': dfs_to_aggregate}. I will switch over the internal structure of the PDFs-to-aggregate object to what you mentioned, but that question still remains.

My opinion is that run_nth_year_taxcalc_model should return serialized dataframes and not the processed_tables. That way there would be some format/aggregation function that is called with the results from the finished jobs. That cuts out some of the logic on the PolicyBrain side of figuring out which outputs are processed or not. It would reduce the data structure complexity too. The logic would be something like:

submit N jobs and store results as metadata + serialized dataframes (or other types of data formats)
upon completion, deserialize data and call upstream_project.aggregate_and_format(data)

@lucassz does this seem sensible?

lucassz commented 6 years ago

@hdoupe Thank you, that seems like a great idea to me. There will still be a distinction made between aggregate and yearly tables for display purposes, but this way they can be treated the same way in the logic. I will implement this.

hdoupe commented 6 years ago

Sounds great. Thanks @lucassz

hdoupe commented 6 years ago

@lucassz I'm not sure about including reform_inputs_file, assumption_inputs_file, reform_parameters, and assumption_parameters in the core model. I would rather it be more generic. It seems like there are a couple ways to do this:

store in inputs_file and parameters fields
- which are JSON objects with keys corresponding to reform and assumptions
have generic fields, inputs_file_1, inputs_file_2, ..., inputs_file_5, parameters_1, parameters_2, ..., parameters_5
- use getter/setter methods in the implemented model classes to interact with these fields in a more readable way

What do you think?

lucassz commented 6 years ago

@hdoupe I agree, my preference was not to have the reform/assumption distinction in the core model, nor have more than 2 fields (inputs_file and parameters). I like the first approach — making them both single JSON objects — I was just trying to change one thing at a time in the TaxBrain logic. A good next step would be for me to do the integration of the two in TB and in the core model.

hdoupe commented 6 years ago

@lucassz that sounds good to me.

lucassz commented 6 years ago

After discussing with @hdoupe, operating under the assumption that backward compatibility internal to the new outputs version with the old, un-migrated format would not be needed, we have removed explicit handling for unrenderable/non-conforming data related to not_avail.html. We also discussed that the most comprehensive way to handle such issues, if we were to add this functionality later, would be to add Django middleware to catch exceptions in the BaseHandler's rendering of template responses.

lucassz commented 6 years ago

@hdoupe said:

I'm not sure about including reform_inputs_file, assumption_inputs_file, reform_parameters, and assumption_parameters in the core model. I would rather it be more generic

I just added a commit consolidating reform and assumption inputs as well as parameters into a single JSON field, so that such internal implementation details can be handled by upstream packages while keeping the PolicyBrain-side structure general.

hdoupe commented 6 years ago

The most recent commit moves the Tax-Calculator behavior parameters onto the static inputs page. This was a goal identified in issue #906 (1a-Model Inputs).

It turns out, the easiest way to integrate the outputs changes that have been done in pr #921 with the behavior app is to simply include the behavior parameters in the static page. Most of the parameter processing code in the webapp is shared by the static and dynamic-behavior apps, the parameters are submitted to the same Tax-Calculator endpoint, and the same results structure is returned.

There are ways that the implementation in the referenced commit could be improved.

The behavior parameters are identified by their prefix '_BE_'. A better solution would have the inputs submitted from the front-end as {'policy': # raw static input data here, 'behavior': # raw behavioral input data here}.
The inputs page could be separated into two main sections, "Policy" and "Behavior" (or similar), and within those sections, there would be the subsections that are displayed now.

The implementation of both of these items would be needed to add more sections such as "Elasticity", "Period", or "Data".

@lucassz and I are working to push these changes to the test app today so that others can review the vast improvements @lucassz has made to the webapp this summer.

hdoupe commented 6 years ago

Given the complexity of these changes, the merge and deployment will occur in stages. Below is a tentative outline of what each stage should be:

This PR should be merged once tests pass and with no assumptions of backwards compatibility
A pre-release should be issued and deployed to the test app for review and to gather feedback. It should use the Tax-Calculator branch in open-source-economics/Tax-Calculator#2037. It should also use the AWS infrastructure from OpenSourcePolicyCenter/pb_deploy#1 (which should then be merged and a pre-release should be issued)
Incorporate feedback and do another pre-release, repeat (2)
open-source-economics/Tax-Calculator#2037 needs to be reviewed more thoroughly and tests should be added
backwards compatibility issues need to be addressed: a. how do we deal with prior behavioral and static simulations b. how do we display the outputs from prior runs (a serious effort should be made to display the outputs from all prior runs, including those that are not available right now)

Once these steps are completed, a full release should be issued.

@lucassz @MattHJensen @martinholmer

hdoupe commented 6 years ago

All tests are passing in PR #921. @lucassz thank you for your hard work in developing and implementing the new approach to displaying outputs. @martinholmer thank you for your feedback as this feature was developed. The changes made in this PR open up many possibilities for how outputs can be rendered in the future. Further, many patterns developed here such as using a core app and Class-Based-Views will be very beneficial for development in the future.

All of the TaxBrain apps (static, behavioral, and macro-elasticity) are now running with the new outputs implementation. Once it has been polished up, there should not be any noticeable difference in how the outputs look. This fall, I plan to transition the Cost-of-Capital calculator to this new approach.

A lot of work is still required to clean up after this major change. There needs to be some of editorial work to indicate that the behavioral app parameters are now on the main TaxBrain page and that the macro elasticity app is available if no behavioral parameters are specified on the main taxbrain page. A comprehensive solution will be applied to this problem once the outputs work started in this PR is complete. I don't anticipate finishing up this work taking more than a week or two.

ospc-org / ospc.org

Outputs revamp #921