minimalparts / PeARS

Archive repository for the PeARS project. Please head over to https://github.com/PeARSearch/PeARS-orchard for the latest version.
MIT License
17 stars 21 forks source link

Make 'findBestPears' work faster #14

Open stultus opened 9 years ago

stultus commented 9 years ago

I have a suggestion.
Right now the profile.txt of each pear contains a 'name', 'message' and 'pear id'. I propose to add a new field called 'version' and increase the value of it whenever the pear is updated (ie changes are made to name/message/pear_id/doc.dists/urls/wordcloud) and while searching, cache the entire data in the local db (app.db). When the next search is made, we will check the version of the remote pear. If a cached version is there in our db and the version is same - use the data from the db, if the version is different fetch the new data from the remote pear and update it in the local db.

what do you think?

minimalparts commented 9 years ago

Yes! That sounds like a great idea!!

nandajavarma commented 9 years ago

If we cache the entire data each time in the local db, say after 1000 searches, there is always a possibility of making the local db huge, right? Is it a good practice to have a cached db store of such data, especially for a search engine?

minimalparts commented 9 years ago

I can't comment on caching problems, but I think there is an empirical issue related to user behaviour: how wide is a person's 'search space'? (i.e. are they obsessed with one or two topics, and can get on with mostly querying the same pears, or do they need many different nodes to query?) I don't have a good intuition for this at the minute. Obviously, it will also depend on what/how much people put on their pears. A related issue, I think, is the 'ideal' configuration of the overall network. How many nodes and how many pages per node.

stultus commented 9 years ago

I'm not asking to cache all the search queries and their results. I'm asking to cache the details from the pears. like profile.txt, wordcloud urls etc. right now we are sending network requests to fetch these details whenever the user issues a query. so that wont make the db huge even if we make a 1000 searches because most of these searches will check for same pears.

But yeah if there arises a case where the db size increases largely, we can always limit the cache queue size to 100 or something. ie store the 100 pears that we used recently(100 is a random number that came into my mind, we can find the optimum number after some observations). when we fetch data from the 101th pear, we can remove the first pear(or something like that).

nandajavarma commented 9 years ago

Cool! That could work. We just need an efficient algorithm to clean up the cache. :+1:

minimalparts commented 9 years ago

How about we only cache a pear on the second visit? Or re-rank the list per number of visits so that we're sure that the most visited ones are always there?

stultus commented 9 years ago

yeah re-ranking sounds good. will try to implement it that way.