Open indolering opened 5 years ago
FWIW, it would be easier to just scrape Google search results of various static sites instead of manually ranking them. I strongly suspect that this will still just become a zombie ticket :P
In the past I have looked into creating something like this, I initially looked into using the cranfield dataset as it is probably more of the kind of size that something like Lunr is more commonly used with. I think I ran into problems with translating the description of the query into something that Lunr would understand.
The dataset you linked to seems interesting, though perhaps larger than the typical index size for Lunr. The other thing to keep in mind is that search relevancy isn't exact, and these datasets would only be able to give an indication of results for one kind of data set / use case. If I've learn't anything over the years of developing and maintaining Lunr its that it is used in many varied ways!
I'm more than happy to help though if you are interested in taking something like this on. It would be interesting to compare different search libraries also to see how they compare.
Spent more time on this, there are data sets from Yahoo and Bing for Learn To Rank competitions. We can use a random subset to train the SVM and go from there. I've decided to turn this into my lunchtime distraction.
Spent a lot more time on this, found some datasets that are more suitable to the task. However, it looks like the scoring is not normalized and the min/max values depend on the exact mix of fields being used. Is this correct?
It would be ideal if there was some sort of integration test that would help gauge changes to the search engine algorithm (such as PageRank boosting, text in
<h1>
tags, etc).The only non-trivial, freely available judgement scores I could find are for a medical database (linked to from this article).