precisely / curious2

0 stars 0 forks source link

Interpret units for graphing purposes #231

Open syntheticzero opened 10 years ago

syntheticzero commented 10 years ago

Plot data should renormalize data with units. We will make best guesses at units based on the spectrum of units being used.

cc @thirdreplicator @heatherannehalpert @visheshd @kimdavis

syntheticzero commented 10 years ago

Design notes:

May need to use tags as hints for preferred units. So this may require some preliminary way to start to do some kind of semantic interpretation of tags, as well.

Basic design are unit groups, such as:

m, meter, feet, ft, yard, yards, miles, km, kilometers, kilometers, kilometer, kilometre

and so forth. We collect units for a given tag and match against unit groups to make a guess as to most likely units. Tag identification will also be involved to select preferred units.

syntheticzero commented 10 years ago

Right now, the way this works is that the graphing subsystem ignores units when it comes to graphing.

Instead, what we should do is create "unit groups" with relative conversion ratios. So for instance, we have a unit group we can call "distance". It could have members like this:

m, meter, foot, feet, ft, yard, yards, miles, km, kilometers, kilometers, kilometer, kilometre, kilometres

Each one would have a double constant with relative ratio. For instance, if foot = 1.0d, yard = 3.0d, and so on.

We can create a TagUnitStats class that gets updated when a new entry is created or deleted or updated that includes counts of all the units strings (including the empty string for no units) and is updated dynamically. Let's make the algorithm be this:

TagUnitStats contains userId, tagId, and lastUpdated date, as well as a map from strings to counts. There is also a method to return the total of all the counts (including for the empty string).

If the total count for a given TagUnitStats is less than 10, dynamically update the TagUnitStats on every entry creation, deletion, or update.

Run a daily batch job to update the other TagUnitStats objects. The batch job should be robust with respect to optimistic locking errors (if one occurs, then go on to the next tag). It should also be transactional so it runs each query in a separate transaction to avoid tying up the database server.

The batch update will count all the entries for each tag and add up the number of units used.

We will interpret units as follows:

The most-used units for a given tag will be the assumed unit for entries without units. Thus:

"aspirin 200mg" and "aspirin 200" will mean the same thing.

if a unit isn't found in the TagUnitMap, assume it is the most-used unit:

"aspiring 200mg", "aspirin 200", "aspirin 200millg" will mean the same thing ("millig" is a misspelling).

Finally, when sending query data back to the client, normalize it to the most-used unit, using the ratios in the unit group.

Note that two unit groups may have the same unit strings shared between them. The units group that matches the stats most closely (most number of matches) wins.

Sample unit groups to start with:

distance: m, meter, foot, feet, ft, yard, yards, miles, km, kilometers, kilometers, kilometer, kilometre, kilometres

duration: m, min, mins, minute, minutes, h, hours, hrs, hour, day, days, d, week, weeks, wks, wk, month, months, mnths, year, years, y, century, centuries, ms, millisecond, milliseconds, microsecond, microseconds, picosecond, picoseconds

(for month, just use an assumption it is 30.4375 days)

weight: g, grams, pound, pounds, lbs, kg, kilograms, (etc.... add on to this list)

And so on. We can just start with these three, it's easy to add more once we have this implemented.

visheshd commented 10 years ago

Note that two unit groups may have the same unit strings shared between them. The units group that matches the stats most closely (most number of matches) wins.

Initially when there are no stats how do we resolve the shared tags?

syntheticzero commented 10 years ago

What we want to do eventually is use the tag itself as a hint. So there should be a method that computes the appropriate tag unit group for a given tag and all the units associated with the tag, and eventually we'll use the tag itself to give hints about the units.

However, for now, for any given unit string we should have a prioritization of which tag unit group is the most likely to be the correct interpretation. For example, "m" should have a larger association with "meters", and a smaller one for "minutes". I guess what we could do is associate a numeric score with each unit string and unit group, to indicate how "strongly" we think the unit should be in that group. So instead of just a list of strings, it's a list of string, number pairs. We can score the association by the sum of all the numbers, not just the number of matching strings.

If, even after all this, the scores are the same for two unit groups, just pick one, doesn't matter.

syntheticzero commented 10 years ago

Let's make the association scores, say, 1-10. Go ahead and assign whatever you think is appropriate for now and we'll go over these later and tweak them.

visheshd commented 10 years ago

Implementation Notes

TagUnitStats

A simple domain to track the count of how many times a particular unit has been used for a particular tag by a user

UnitGroupMap

A class where we group units into unit groups for the purposes of conversion

Each time a user makes an entry we check if the unit was used before and update the count. If this unit is being used for the first time for this tag and if we don't have it in one of our groups in UnitGroupMap then we use the most used unit for this tag

If this unit is not in our unit group and this is the first entry for this TAG we create a new TagUnitStat with whatever unit that came with the entry

Plot Data normalization

No normalization happens if there is no (tag + unit) usage history or the unit used has not been grouped (Since it can't be converted) or the two units belong to two different groups and can't be converted

If the two units belong to the same group, the conversion happens. First the value is converted to a unit with a relative ratio 1 and then converted to the intended unit.

visheshd commented 10 years ago

Tested track interface, graph interface and community interface. 0 tests failing. Ready to be merged at https://github.com/syntheticzero/curious2/tree/tag-unit-groups

syntheticzero commented 10 years ago

Please see comments in code review #316

syntheticzero commented 10 years ago

Let's not forget this ticket --- the code is not yet finished, please address the code review comments.

visheshd commented 10 years ago

Addressed the issues on the pull request. Ready to be merged.