Closed overthink closed 9 years ago
Hi Mark,
I have looked at your message. I will be making the next update on Sunday. The issues with the formatting on a role per line would not take time.
I made the assumption on the problem set that the same product between two datasets must have the same manufacturer. This is a very reasonable assumption to make as it matches my intuition.
However, the issue of 20x lesser is something that I have been thinking about ways of fixing. There are a number of approaches that I would try.
1) Develop a better hash function 2) Use a nearest neighbour classification. This would involve that I hash to the same bucket reviews that are within a set threshold. This is probably the best solution to your problem. 3) I was thinking after perfusing over the algorithm. I can say with some level of certainty that preprocessing methods such as stopword removal can help use remove noise can can affect the quality of the hash generated.
See you on Sunday when I make the update.
Kenneth
Hi Mark,
I have tried to improve 20X reduction in size. I saw the hash function was good so I tried to work on some preprocessing such as stop word removal and data cleaning. I will now have to switch back to my current research projects as I won't have much time to keep studying on techniques to improve the code. Thanks for the coding challenge.
Kenneth Odoh
Hi Kenneth.
Could you please ensure the format of your output file matches the one described at http://sortable.com/challenge/ -- in particular "The output your solution creates should be a text file with one Result object per line".
Also. the size of your output is about 20x smaller than I'd expect, so you may want to look into that.
Thanks for your time.
Mark (from sortable.com)