Ranking according to non-numeric attributes

forgues commented 10 years ago

Our ranking algorithm is mainly used to rank numeric attributes right now. But users can also select non-numeric attributes (e.g. string or boolean). When a user selects such an attribute, the ranking algorithm ranks products depending on if they have it or not, and ignores the attribute's actual value (e.g. all products which have the attribute are at the top of the list, and all products which are missing the attribute are at the bottom, with no specific ordering).

It would be much better if we could find some way to score boolean/string attributes based on their value, just like we do for numeric. If we think it's worth it, I can spend some time investigating how we could do this. If we think it's not worth it, we could simply remove non-numeric attributes from the attribute selection list.

forgues commented 10 years ago

I find that computing the average overall score for each string/boolean value isn't so bad. Of course, it's not perfect, but I think it should work well enough.

Here's an example where it works well (in humidifiers): ANTIMICROBIAL EFFECTIVENESS "false" mean score: 58.701069704761885 "true" mean score: 73.14603632

And one where it doesn't work so well (still for humidifiers): COLOR "Silver" mean score: 33.216752 "White/blue" mean score: 38.8064452 "Black" mean score: 46.0890982 "White" mean score: 66.06723317142857 "Pink" mean score: 81.716102

(There are several white humidifiers, but only a single silver, white/blue, black and pink humidifier.)

So I think I'll use the average overall score to find which string/boolean values are better than others. Now I want to weigh the importance of the attribute. Ideally, "antimicrobial effectiveness" would be weighed more important than "color", because "color" is mostly divided by a single product for each string value. If anyone has a suggestion for weighing these nominal attributes, let me know.

asutcl commented 10 years ago

I am already computing the entropy for these attributes.

One idea we could play with is to set a default value for the correlation of boolean/string attributes.

Then the score could be entropy*correlation, for example. This can be very crude though. We should discuss this in greater detail this afternoon if you have time.

On Apr 3, 2014, at 12:06, Gabriel Forgues notifications@github.com wrote:

I find that computing the average overall score for each string/boolean value isn't so bad. Of course, it's not perfect, but I think it should work well enough.

Here's an example where it works well (in humidifiers): ANTIMICROBIAL EFFECTIVENESS "false" mean score: 58.701069704761885 "true" mean score: 73.14603632

And one where it doesn't work so well (still for humidifiers): COLOR "Silver" mean score: 33.216752 "White/blue" mean score: 38.8064452 "Black" mean score: 46.0890982 "White" mean score: 66.06723317142857 "Pink" mean score: 81.716102

(There are several white humidifiers, but only a single silver, white/blue, black and pink humidifier.)

So I think I'll use the average overall score to find which string/boolean values are better than others. Now I want to weigh the importance of the attribute. Ideally, "antimicrobial effectiveness" would be weighed more important than "color", because "color" is mostly divided by a single product for each string value. If anyone has a suggestion for weighing these nominal attributes, let me know.

— Reply to this email directly or view it on GitHub.

forgues commented 10 years ago

Yes we could certainly combine entropy and correlation. But before doing that, I'd like to compute some weight for boolean/string attributes instead of just giving them a default value.

The best way I can think of is using the average scores as cluster centroids, and then clustering products to the nearest centroid. We could use the number of correctly predicted attribute values as the attribute's weight, and perhaps normalize by the number of clusters. But there might be an easier/better way I haven't thought of.

forgues commented 10 years ago

I spoke with @asutcl and I will replace the field private List<TypedValue> aStringValues; by a Map<TypedValue, Double> in ScoredAttribute. Instead of being a list of possible values, it will be a mapping of possible values with their respective scores. The map can then be used to find if a value is good or bad compared to other values based on their relative scores.

forgues commented 10 years ago

Merged into master. Products are now ranked according to string and boolean attributes as well.

asutcl commented 10 years ago

How are boolean values converted to strings for the hash maps? My issue is that when you pass a TypedValue that is boolean to the ScoredAttribute and you want to know if true or false is better. How to I check the scores of true or false from the nominal correlator?

forgues commented 10 years ago

I convert boolean values to strings because for the nominal correlator there's really no difference between a boolean attribute and a string attribute. I didn't want to create two methods or two maps, when strings and booleans are really handled the same way.

If you want to check the score of a boolean you can simply do String.valueOf(booleanValue) and check that string in the map. If you would prefer that I make some change, just let me know.

asutcl commented 10 years ago

I just wanted to know how to find it, thanks!

forgues commented 10 years ago

Ok, closing the issue then.

prmr / Creco

Ranking according to non-numeric attributes #90