Finalize graphic representation of data

dman7 commented 9 years ago

The current time-series unfortunately obfuscates how many locations were positive or potential before being eliminated. This is a common problem if the location is eliminated the same day as it's identified. In my correspondence with Marco:

"I'm starting to understand what Harold wants, and I think we're converging on the same solution. To make sure we're on the same page, I see the following problem:

Harold's team are quick to eliminate inspected locations. In fact, they eliminate most of them within a day,
Harold wants to know how many inspected locations are positive and potential on a given day, even if they are eliminated the same day.

My initial thought is that what we need here is to track the time of the day that a location was identified positive, potential, and time of day it was eliminated. Doing so will help us because:

We know, down to the second of the day, if the location is positive, potential, or eliminated,
This extreme granularity allows us to say things like "this location was potential in the morning, but was eliminated by dinner time",

This changes our time series of locations slightly: suppose a location was identified POSITIVE on day T. There are three scenarios:

Location was not eliminated In this case, the location remains POSITIVE for [T, T+7], including the day T+7.
Location was eliminated same day (day T) In this case, the locations remains POSITIVE for day T, and then becomes ELIMINATED starting day T+1.
Location was eliminated on day T+7 In this case, the location remains POSITIVE for [T, T+7). Note that the location is labeled ELIMINATED starting day T+7. This may be an arguing point: should it remain POSITIVE on day T+7 since the visit took place on day T+7?"

The solution is to introduce a time element to location_statuses that

tracks the identification type (e.g. positive, potential or negative) with identification_type
tracks the identification time with identificated_at
tracks the time of elimination (if applicable) with eliminated_at

We should also deprecate the status column as the "status" will now be calculated by comparing identification_type, identified_at, and cleaned_at.

dman7 commented 9 years ago

What's left to do:

[x] Add a rake task to backfill identification_type, identified_at and cleaned_at of existing LocationStatus,
[x] Add callback to Report model to delete LocationStatus whenever a Report is deleted.
[x] Add tests,
[x] Rename location_statuses to visits
[x] Update the time series algorithm to dynamically calculate a Visit's status

dman7 commented 9 years ago

More work needs to be done per this Saturday's meeting. See issue #425 for talking points.

dman7 commented 9 years ago

Based on #425, here is the proposed algorithm update:

[x] Add visit_type to visits table. This allows us to separate "inspection" visit from "follow-up" visit. The purpose of these types of visits are different.
[x] Change time-series algorithm to add points only for dates that we data. No need to iterate over every single day between two dates.
[x] Deprecate status column,
[x] Deprecate identified_at columns and cleaned_at column. These should be replaced by visited_at column,
[x] Replace reference to identified_at and cleaned_at. We're no longer using these columns to identify the status of a visit.
[x] Update csv_reports/index.html page with new statistics conventions,
[x] The rake backfill cannot use the existing Report methods because we use the report's current status. This is obviously incorrect because it should reflect what the status was on report's creation/elimination date. For that reason, when creating a new inspection visit, we need to manually define the associated report as positive/potential based larvae and protected columns.
[x] See why the timeframes don't work (e.g. 3 months)
[x] See what we can do with regard to x-axis label appearing.
[x] Add 6 months as an option.
[x] Add Spanish/Portuguese translation to timeframe filters
- 1 mes, 3 meses, 6 meses, Todo el tiempo
- 1 mês, 3 meses, 6 meses, ???
[x] Change sequence of visit type filter: group positives together; group potentials together,
[x] Updates CSV parsing to identify visits based on "identification date" and "elimination date",
[x] Update test suite to test Visit model,
[x] Fix after_filter and before_filter for calculating statistics.

The last bullet point raises an interesting question: do the brigade members record a follow-up visit in "identification date" column or only in "elimination date" column? @brujonildo , can you shine light on this issue? If an identification visit occurs on "2015-01-27", and 3 sites are identified, then there will be 3 row entries in the CSV report (that's agreed upon). Now suppose a follow-up visit takes place on "2015-01-28". Is there a new row entry with "2015-01-28" in "inspection date", or does that visit just update "elimination date" for the existing 3 rows?

dman7 commented 9 years ago

Consider the following chart of Francisco Meza on 2015-02-02:

screen shot 2015-02-02 at 12 00 48 pm

The 2% on 2015-01-11 is the number of houses that were identified as positive on the inspection date relative to the total number of houses in the neighborhood. This is a misleading metric. What useful information is it telling us? Not much. It only tells me how that day's work was relative to the total. Not a very good metric. Instead, we should add a toggle with these metrics instead:

[x] Daily metric: Positive is percent of houses identified as positive on the visit date relative to the total number of houses visited on that date.
[x] Cumulative metric: Positivity is percent of houses identified as positive on or before the visited date relative to the total number of houses in the neighborhood.

For instance, consider the following example: Suppose we have 100 houses. On 2015-01-11, you find 5 positive houses. On 2015-01-12, you find 5 houses with positivity (different houses). Naturally, we expect 2015-01-11 to be 5% positive and 2015-01-12 to be 10% positive. The above graph and the underlying algorithm return 5% on 2015-01-11 and 5% on 2015-01-12. Confusing...

And we should add the following labels:

[x] Cumulative percentage relative to all houses visited - Porcentaje acumulado relativo a todos los lugares visitados
[x] Daily percentage relative to houses visited on a date - Porcentaje relativo a los lugares visitados en un dia

Finally, I also want to test the different filters when we don't have cookies set:

[x] Ensure that the checks for filter settings in the cookies works when no cookies are set.

dman7 commented 9 years ago

There are one final thing to make sure:

[x] Ensure that the checks for filter settings in the cookies works when no cookies are set.
[x] When setting an identification type for a location, you should only look at today's reports rather than all history. This is because when we ask what was the identification for a location on that day, we can only draw from that day's data... the problem with this is that it may mess up how we calculate cumulative percentages. Worth a thought (or does it mess up? Perhaps we can have a tracking algorithm that updates the status selectively when iterating in the calculate_cumulative method).

dman7 commented 9 years ago

After receiving Harold's data, I think it's best to proceed with the filters by removing most filters and only keeping "Positive", "Potential" and "Daily metric". Why? Harold's graphs contain only those without a hint to initial-versus-followup visits. The graphs also do not contain a cumulative metric.

[x] Update the filters to display only positive/potential and daily metric (keep timeframe),
[x] Ensure that the filters work by checking that the cookie is updated.

dman7 commented 9 years ago

[x] Add green bar with legend "Lugares sin criaderos"
[x] Count "potential" houses even if the location is "positive"
[x] Add a disclaimer to charts: Nota: Un lugar puede ser contado doble si tiene criaderos positivos y también potenciales
Stylistic changes to bar chart:
- [x] Add numbers above the bars
- [x] Define dates vertically in the chart
- [x] Make the bars wider
- [x] Make the bars change with the timeframe
- [x] Remove "Porcentaje de" from all labels
[x] Change to line graph for 6 or more month timeframe.

dman7 commented 9 years ago

We're reached a satisfying checkpoint with the charts. Here is the 1 month view:

screen shot 2015-02-13 at 5 01 03 pm

And here is the 6-month view:

screen shot 2015-02-13 at 5 01 14 pm

socialappslab / denguechat

Finalize graphic representation of data #416