Closed forgues closed 10 years ago
I started implementing the equivalence class search using Lucene, but the equivalence classes I'm getting are much narrower than I expected. For example, I was expecting a "smart phone" equivalence class, but instead I'm getting an equivalence class for each provider. For the query "iphone", any JACCARD_THRESHOLD above 0 gives me the following:
Found 5 results for "iphone" Score 1.536207 - T-Mobile smart phones Score 1.4483498 - AT&T smart phones Score 1.2801725 - Verizon smart phones Score 1.2801725 - Sprint Nextel smart phones
Are these really the equivalence classes we want to consider, or should I look into improving the algorithm?
Yup, the algorithm is too conservative, in the sense that it fails to group things together in many cases. You could definitely improve it -- I think we'll want to do that eventually if not now.
One thing to keep in mind: the use of Attribute
s (Rating
s and Spec
s) to determine equivalence classes, guarantees that the downstream processes (the feature selector and critiquing engine), get a set of products that are comparable. Presumably, smart phones aren't being grouped because they have no Attribute
s in common, and therefore, the feature selection and critiquing downstream would have to tangle with a universe of products that is, programmatically at least, heterogeneous.
This is just something to keep in mind.
Message me if the algorithm is hard to follow.
Hey, I just looked into this a bit, and there seems to be a bug. The jaccard should be really high for smart phones (And I could have sworn that in my python version they were grouped!). It should actually be something near 1.0.
Although the algorithm might need improvement, let me nab this bug first so it doesn't waste your time!
Ok, I just fixed that bug. Smart phones now get a jaccard of > 0.9. Pull from master to get the fix.