Integrate Lucene with the equivalence class search

prmr / Creco

Recommendation System for Consumer Products

Apache License 2.0

6 stars 2 forks source link

Integrate Lucene with the equivalence class search #8

Closed forgues closed 10 years ago

forgues commented 10 years ago

Create a class which initializes an index for all equivalence classes (by collapsing all text data of all products under each equivalence class)
Create a method in this class to search through the index based on a string query.
Return a list of equivalence classes sorted from most to least relevance to the query.

forgues commented 10 years ago

I started implementing the equivalence class search using Lucene, but the equivalence classes I'm getting are much narrower than I expected. For example, I was expecting a "smart phone" equivalence class, but instead I'm getting an equivalence class for each provider. For the query "iphone", any JACCARD_THRESHOLD above 0 gives me the following:

Found 5 results for "iphone" Score 1.536207 - T-Mobile smart phones Score 1.4483498 - AT&T smart phones Score 1.2801725 - Verizon smart phones Score 1.2801725 - Sprint Nextel smart phones

Are these really the equivalence classes we want to consider, or should I look into improving the algorithm?

enewe101 commented 10 years ago

Yup, the algorithm is too conservative, in the sense that it fails to group things together in many cases. You could definitely improve it -- I think we'll want to do that eventually if not now.

One thing to keep in mind: the use of Attributes (Ratings and Specs) to determine equivalence classes, guarantees that the downstream processes (the feature selector and critiquing engine), get a set of products that are comparable. Presumably, smart phones aren't being grouped because they have no Attributes in common, and therefore, the feature selection and critiquing downstream would have to tangle with a universe of products that is, programmatically at least, heterogeneous.

This is just something to keep in mind.

Message me if the algorithm is hard to follow.

enewe101 commented 10 years ago

Hey, I just looked into this a bit, and there seems to be a bug. The jaccard should be really high for smart phones (And I could have sworn that in my python version they were grouped!). It should actually be something near 1.0.

Although the algorithm might need improvement, let me nab this bug first so it doesn't waste your time!

enewe101 commented 10 years ago

Ok, I just fixed that bug. Smart phones now get a jaccard of > 0.9. Pull from master to get the fix.