Closed nicholas-denis closed 2 months ago
the old code used two dictionaries because the way the old code worked was that it would parse the primary_metrics dictionary then create the graphs based off that. This was done so that I didn't have to label my outputs as to plotted or not. With the new code, this should no longer be necessary.
Also, as a suggestion in this case, should there just be a new plotting.py file every time a new experiment is run. Plots should vary pretty heavily from experiment to experiment anyways, and it shouldn't be too hard to write down new plotting functions, so long as we have a base/reference to refer to.
New dataframe structure code completed
Things are looking good. Given how little supervision and feedback that I have provided to you, you have implemented quite a bit and done a great job. I wish I had more time to interact more with you and work on this. Unfortunately, that hasn't been the case. With that said, I have looked at the code and have noticed a few locations where we should make some changes. I will do my best to list them here:
PPI CI
ppi_theta = ppi_py.ppi_mean_pointestimate(y_gold, y_gold_fitted, y_fitted) ppi_theta_ci = ppi_py.ppi_mean_ci(y_gold, y_gold_fitted, y_fitted)
becomes: ppi_theta, ppi_theta_ci = do_ppi_estimate_and_ci(y_gold, y_gold_fitted, y_fitted)
where
def do_ppi_estimate_and_ci(y_gold, y_gold_fitted, y_fitted, total_estimate=False): if total_estimate: pass # TODO: when we want to estimate for totals /proprtions else: ppi_theta = ppi_py.ppi_mean_pointestimate(y_gold, y_gold_fitted, y_fitted) ppi_theta_ci = ppi_py.ppi_mean_ci(y_gold, y_gold_fitted, y_fitted) return ppi_theta, ppi_theta_ci
It is unclear to me why we have "primary" and "secondary" metrics?
The dataframe (and csv) file. Right now, these are the columns: ppi_widths,naive_widths,classical_widths,ppi_preds,ppi_lowers,ppi_uppers,naive_preds,naive_lowers,naive_uppers,classical_preds,classical_lowers,classical_uppers,ppi_coverages,naive_coverages,classical_coverages,rho,iteration -My thoughts: I think we can simplify the table, but at the expense/overhead of having to change how the dataframe is being created/grown, and therefore likely how the metrics are represented over the course of the experiment
Explanation: technique will be a string in {naive, classical, ppi, ppi++} that tells us what this row represents in terms of technique model will be a string (or None) in {linear_regression, decision_tree, random_forest, .... } etc. This tells us the model class used for PPI. Repeat is an integer, if num_its is 100, then this will be integers in 0,1,...,99 desired_coverage comes from config file empirical_coverage, ci stuff and estimate is obvious true_parameter is what we expect the estimate to be (the population parameter we are trying to estimate). noise here is the rho value... but I made the column name more general, for when we do tabular data, we won't use rho, ....
My suggestion on how to implement this:
So later when we do plots, everything is in the dataframe, we just go get all the the values for a particular technique and get the appropriate column means
PLOTS: