recommeddit / labs

ML/data experiments for Recommeddit
MIT License
1 stars 0 forks source link

Recommendation cross-referencing #11

Closed Gopu2001 closed 2 years ago

Gopu2001 commented 2 years ago

There are several ways we can go about this:

  1. use a knowledge graph (KG) API (we use someone else’s KG) to see if it has some popularity a. does not work with unpopular products that could be recommended b. Not sure if we would be able to edit graph and add the new up and comers / less popular products in the industry
  2. create a KG of our own that adds new information as we get it (high customizability for runtime and flexibility) a. make take a lot of time, effort, and other resources to build it b. can be optimized to make each subsequent related search on recommeddit platform faster c. This should be coupled with another idea
  3. basic solution: google search with quotation marks around product and see if there is more than 1 mention on the Internet a. internet searching for verifying hundreds of product recommendations (ie, someone just spams a list of movies to watch) could take a long time no matter how fast the Internet connection speed is
  4. check if the same product has been mentioned in the same thread more than just that once a. super fast processing (data is already there) no time bottleneck on Internet or memory bottleneck from KG
  5. searching if product has its own subreddit a. highly dependent on the product being (1) popular, and (2) old enough to have a subreddit
SwiftWinds commented 2 years ago

Original issue text: Research methods of checking if the extracted recommendation is actually a valid recommendation (e.g., if I extract "CLion" how do I check that it's a valid respond to "best C++ IDEs reddit" i.e., that it is a C++ IDE) (pointers: perhaps google " C++ IDE" programmatically after extraction and seeing how good the results are or using a knowledge graph or search the respective subreddit e.g., "r/Logitech" for "Logitech G533" and check if # of references surpasses the threshold, which is dynamic based on the size of the subreddit) and discuss with team on which ones might be best to use (e.g., pros and cons of each one)

SwiftWinds commented 2 years ago

basic solution: google search with quotation marks around product and see if there is more than 1 mention on the Internet

This is a rather great idea! Let's talk Sunday w/ the team on this idea and perhaps it implement it soon

Gopu2001 commented 2 years ago

Keep in mind that with each of these ideas, there are pros and cons. My belief is that, to decrease the number of false positives in our output model, we need to cross-reference using at least 2 of these ideas.

For our minimum viable product, we can simply use the CSE (Google Custom Search Engine) idea. To go a level beyond that, we should also check if the same product/recommendation has been recommended or mentioned elsewhere in the same comment thread/discussion. This MVP should suffice for short-term but, without using a larger and more complex structure for our database and caching, the wait times of our processing could result in an early decline of the product

Gopu2001 commented 2 years ago

Thus far, all that has been implemented is point 3 of the checklist above. This is reflected in the changes under AnmolStuff. To be fully marked as completed, I feel that this part needs testing (i.e., someone else needs to ensure that I have not missed a test case in a case of bad input). As such, I will label this issue with a "further testing needed" label.

Gopu2001 commented 2 years ago

Will leave this issue closed because the base code has been implemented.

There are some updates / upgrades that could be made to the code files (ie additional features), but might only add them as needed or as time permits