versioneye / versioneye

VersionEye.com
https://www.versioneye.com
150 stars 33 forks source link

Improve Search #339

Closed rmetzler closed 10 years ago

rmetzler commented 10 years ago

I tried to search for some CocoaPod specs I know exist.

My problem is, search seems to work like this:

Product.find( { 
    :language => params[:language],
    :prod_key => params[:query] } )

So if I want to find ABTest and try the keywords ab or test or ab-test I don't find anything. It actually has to be abtest

The same goes for 500px-ios-api

reiz commented 10 years ago

Yes. I realised that already. The search works currently with ElasticSearch. There is room for improvement. But it's currently not the most urgent task. CocoaPods, Bower and BitBucket is more urgent.

rmetzler commented 10 years ago

the problem I see is, when we are telling everyone that we support CocoaPods and the can't actually find a cocoapods package. That's why I said it.

reiz commented 10 years ago

I see the problem. I did some research and I understand now why we get the search results we have currently. But I still don't know how to change it. The problem is that elasticsearch is to smart. The ElasticSearch analyzers are looking for words. They even have a dictionaries for different languages to recognise words and word families. That is very useful for full text search. If you search in a full text for blueberry the will match "blueberry" and "blueberries" because it's semantic the same. But they will not match "blue" because it has a different meaning. For our package name search this is kind of over engineered. I guess we need a simpler matching.

rmetzler commented 10 years ago

thanks for explaining.

I'll put that in the back of my head and maybe I'll come up with a solution.

Is there another use case where we use ElasticSearch?

reiz commented 10 years ago

We use elasticsearch for the package search mainly. And in the project in the tab "collaborators" the autocomplete for usernames works via ES. I am currently digging deeper into ES. Will write some tests and play around with ngrams in ES. Maybe I can fix the search today.

rmetzler commented 10 years ago

:+1:

timgluz commented 10 years ago

Well, for Richard issue there's quick fix: just add asteriks to search query as autocompletion does ab -> ab* (as example)[http://www.versioneye.com/search?lang=%2C&q=500*&g=]

But correct solutions is to use multiple queries and add weights to them - 1exact match, another partial match and another with greedy * selector. Tuning tokenizers is just partial solution: it fixes one problem and adds new.

rmetzler commented 10 years ago

If there is nothing found then we should add the greedy operator to all keywords and search again.


Also, I recognized that the feedback dialog only shows up if there is nothing found. There might be cases where you have 1 or 2 libraries that match the keywords but it's not what you are looking for.

reiz commented 10 years ago

This "*" doesn't help. If I add that as default to all queries than the search results even become worst. Because then somehow "Hibernate" is not anymore the perfect match for "Hibernate". then the first result is "spring-hibernate".

reiz commented 10 years ago

Somehow ElasticSearch should be easier to understand and to configure.

timgluz commented 10 years ago

That's why combined queries are built in elasticsearch - you dont need fallback function to check size of results.

reiz commented 10 years ago

I will check out combined queries. But with ngrams I don't need to check size of results either. It's just a configuration how a single word should separated in smaller character groups.

rmetzler commented 10 years ago

This "*" doesn't help. If I add that as default to all queries than the search results even become worst. Because then somehow "Hibernate" is not anymore the perfect match for "Hibernate". then the first result is "spring-hibernate".

that's why I suggested using it as a fallback instead of default.

reiz commented 10 years ago

It works now like expected!

I configured 2 new analyzers. The trick is that we are using a different analyzer for indexing then for searching. At the index time the "ngram_name" analyser is doing the job. It creates many engrams to each product name. For "Hibernate" it creates for example:

hib
hibe
hiber
hibern

and so on. If we search now for "Hiberna" it will have a perfect match and we get the prod_key from "Hibernate" back.

Now I need to re create the whole index on production.

reiz commented 10 years ago

The last commit is boosting on follower numbers. If you search now for "spring" your will get as first result "spring-core" because it has many followers. The 2nd result will be "spring" because it's a perfect match, but doesn't have any followers.

rmetzler commented 10 years ago

is it possible to order by used_count or similar?

reiz commented 10 years ago

The idea behind a search engine is that you don't use order by. The score is based on similarity. We can think about how to use used_count to improve the search results. But not this week. That will not bring us more users. We have to focus on growth hacking!

rmetzler commented 10 years ago

I just wanted to say, if you search for spring and there is a project that 5 artifacts reference (currently number 1) and a project that more than 5.000 artifacts reference (currently number 4 in search results), than the one with 5.000 should be first.

http://www.versioneye.com/search?q=spring

User growth = improving user experience ;-)

reiz commented 10 years ago

Yes. I know what you mean. But we can even have a better search than Google, if nobody knows about it we have to shut down in 4.5 months. The product is good enough! We have to be more active on Twitter, HN & Reddit. We need more users!