pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

Investigate replacing search with MeiliSearch #8002

Open di opened 4 years ago

di commented 4 years ago

We made a working demo of this with the open source MeiliSearch:

https://pypi.meilisearch.com/ https://github.com/meilisearch/meilisearch-python

This could also be extended to support the more advanced features discussed in #727. If the warehouse devs are interested in using MeiliSearch we'd be happy to discuss it here or by email: erlend@meilisearch.com

Originally posted by @erlend-sh in https://github.com/pypa/warehouse/issues/3486#issuecomment-634756428

di commented 4 years ago

Hi @erlend-sh, thanks for sharing, that looks really slick and the results look really good! A few questions for you after a quick glance:

erlend-sh commented 4 years ago

Would we need to host MeiliSearch ourselves or is there a hosted option?

As of right now the only option is self-hosting, but we are indeed working on a hosted option. If you can email me we can discuss that further. One of my colleagues will come by soon to answer your other questions.

eskombro commented 4 years ago

Hi! I am the author of the MeiliSearch PyPI demo (and a member of the MeiliSearch dev team) and I am here to discuss any details, because this seems like a great and exciting idea 🎉

Regarding your questions:

How could MeiliSearch handle pagination & caching? One existing issue with have with search is #4006, where a given result may appear on multiple pages.

MeiliSearch provides both an offset and limit parameters to a Search request, which are consistent and would solve naturally #4006

this is not used in the demo, but can totally be used for pagination

How would MeiliSearch handle our existing search filters?

Meilisearch provides filtering and will release really soon the faceted search. We would need to re-index all the content from PyPI packages, to take into account the filters you want to apply, but the functionality would be provided 'by default' by MeiliSearch, no need to implement anything new, just need to be sure that the corresponding fields are present in the MeiliSearch index.

Is there a way to weight values of certain fields or otherwise ensure that exact matches on some fields are the top result? E.g. if I wanted to guarantee that the foo package would always be the top result for https://pypi.meilisearch.com/?q=foo, since the project name is an exact match for the query.

MeiliSearch uses Ranking rules which are totally customizable. I have played with this ranking rules to try to provide the most relevant results, but I actually didn't have in mind to do an exact match of the name of the package (it is present in the ran king rules, but with a very low relevance) because we wanted more to provide keyword and concepts search. For example when you lookup "machine learning" you get a package called "machine learning" which is far from being the idal package to use, I wanted to see for example "tensorflow" in the firsts results. Same for "web framework"... I want so see Django, Flask, etc...

Anyway, this is fully customizable and we can discuss which are the priorities and adapt the ranking rules pretty easily

MeiliSearch is also typo-tolerant, which needs to be taken into account for this point

How is MeiliSearch using download counts here? I.e. is it part of the search weighting, or is it just being displayed on top of the results after the fact?

For the PyPI demo we are getting the last month downloads from BigQuery (once a month, and we cache it), and we use it as a ranking rule. The thing is that MeiliSearch uses a Bucket Sort algorithm to apply the ranking rules, which caused a few problems to categorize packages, because download numbers is very specific to each package. So instead of basing the results simply on Downloads numbers, we created a "fame" field that assigns a score to each package based on downloads. For example, the first 100 most downloaded packages from last month will receive the highest score in fame. This means that the Bucket sort algo will consider those 100 packages as a single bucket, and apply the rest of the rules inside of it. This means that you can have a result that has less downloads and still be more pertinent, but a fame score of 9 will always be better than a fame score of 8.

This is also customizable.

How well does MeiliSearch degrade when JavaScript is not present?

I am no sure I undertand your question :(

I hope this solves more or less your questions, but please do not hesitate to ask for any clarification or any new information you would like to have!

di commented 4 years ago

Sorry for the delay! Replies inline:

MeiliSearch provides both an offset and limit parameters to a Search request, which are consistent and would solve naturally #4006

I'm not seeing how just the offset/limit parameters alone would solve #4006. The issue is that URLs with the same query but different offsets expire from the cache at different times, which may cause a project to appear more than once (or not at all).

Ideally we'd need a way to know all the projects that would be in the result for a given query, so we could purge all queries that contain a given project if that project changes.

That said, I don't think this is something that MeiliSearch needs to solve for us to adopt it, as it sounds like as-is this has feature parity with our current search provider. Just wanted to see if there was any potential solution for this problem.

Meilisearch provides filtering and will release really soon the faceted search.

Sounds like faceted search is what we'd need here for classifiers.

MeiliSearch uses Ranking rules which are totally customizable. ... MeiliSearch is also typo-tolerant, which needs to be taken into account for this point

Right, I think the difference here is that the query "machine-learning" is different than a query for "machine learning". We could potentially just highlight an exact match ourselves if it exists (without using the search provider) but we'd then need a way to remove it from the search results (as it would probably also appear in the first page of results).

For the PyPI demo we are getting the last month downloads from BigQuery (once a month, and we cache it), and we use it as a ranking rule.

Interesting, thanks for sharing. FYI, right now we compute the "zscore" to determine which packages are "trending", which we can order results by (as well as the ability to sort by date last updated).

I assume the download counts can be updated the same way as updating any other search metadata?

How well does MeiliSearch degrade when JavaScript is not present?

I am no sure I undertand your question :(

Sorry! What I mean is: right now PyPI works pretty well if you disable JavaScript in your browser. We probably want this to continue to be true. I see that the demo uses JavaScript extensively to provide auto-updating searching, but it seems like if I disable JavaScript entirely, using the input as a regular search field (typing a query and submitting with 'enter') doesn't work. I feel like this might be just the nature of the demo though, so my question is whether JavaScript is a requirement or not.

di commented 4 years ago

Hi @erlend-sh, @eskombro, any updates here?

erlend-sh commented 4 years ago

I'm quite sure MeiliSearch can work fine without JS, but @eskombro will have to confirm.

Once that's cleared up, what are the next steps here? Could a fork of pypi.org be set up as a staging site? Like meilisearch.pypi.org. We can provide a DigitalOcean droplet for it.

eskombro commented 4 years ago

Sorry! What I mean is: right now PyPI works pretty well if you disable JavaScript in your browser. We probably want this to continue to be true. I see that the demo uses JavaScript extensively to provide auto-updating searching, but it seems like if I disable JavaScript entirely, using the input as a regular search field (typing a query and submitting with 'enter') doesn't work. I feel like this might be just the nature of the demo though, so my question is whether JavaScript is a requirement or not.

Indeed, the demo was done to show the idea of search-as-you-type experience, and this requires javascript enabled on the client. But as you point out, that is just the nature of the demo. It doesn't have a "results" page you can navigate to. Disabling JS shouldn't have any impact on the way MeiliSearch behaves, and this would be just a front-end matter we can solve easily with a different implementation.

I assume the download counts can be updated the same way as updating any other search metadata?

Exactly!

di commented 4 years ago

I think we were waiting on a few things to move forward:

Once that's cleared up, what are the next steps here? Could a fork of pypi.org be set up as a staging site?

I think next steps would be determining where we need to host Meilisearch, and then starting to develop this in a feature branch.

We could potentially spin up a staging site -- what would be the goal for that? We also have https://test.pypi.org/ which we could use, as long as everything Meilisearch-related is behind a feature flag.

erlend-sh commented 4 years ago

I think next steps would be determining where we need to host Meilisearch, and then starting to develop this in a feature branch.

Okay, great. We will set up a hosted instance of MeiliSearch shortly. Can you get that feature branch started and link it here?

di commented 4 years ago

Feature branch is here: https://github.com/pypa/warehouse/tree/meilisearch

eskombro commented 4 years ago

Thanks @di

I did already fork the project and set up the environment, and I'm exploring it a bit. I think next week I can start working on this. How would you like to proceed with this feature development?

di commented 4 years ago

Right now the codebase definitely assumes that we are using Elasticsearch, so first steps would be to create a generic ISearchService interface, and wrap everything related to Elasticsearch with an ElasticSearchService service. This would be similar to what we do for file storage:

https://github.com/pypa/warehouse/blob/249a4f9ec4ac8118ac17e206840600a08242a9af/warehouse/packaging/interfaces.py#L16-L34

https://github.com/pypa/warehouse/blob/249a4f9ec4ac8118ac17e206840600a08242a9af/warehouse/packaging/services.py#L109-L137

https://github.com/pypa/warehouse/blob/249a4f9ec4ac8118ac17e206840600a08242a9af/warehouse/packaging/services.py#L173-L205

This should allow us to configure which service we're using by changing a single environment variable, SEARCH_BACKEND, similar to this:

https://github.com/pypa/warehouse/blob/0c9ffd5ccb2171dd6141e5cf69409df3249ad805/warehouse/config.py#L203

https://github.com/pypa/warehouse/blob/557ca0ece02f8570be7da9de7c5c1cc713d96ff4/dev/environment#L25

Then, we would implement a MeiliSearchService service, which also implements the ISearchService interface, which we could then enable via an environment var.

We'd also need to update the development environment to be able to use MeiliSearch locally as well:

https://github.com/pypa/warehouse/blob/9019b5eb6901bde9d0a6bfcfc3ef25c3f926bc0b/docker-compose.yml#L26-L33

I'm assuming that there's a docker image we'd be able to use for this?

amirouche commented 3 years ago

any updates on this?

di commented 2 years ago

An update here: MeiliSearch set us up with a demo of their hosted service and have offered to provide it to us as an in-kind donation, so I think it makes sense to move forward with steps in https://github.com/pypa/warehouse/issues/8002#issuecomment-667325333 so we can try it out for PyPI.