pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Search does not prioritize exact match when package name contains numbers #2877

Open di opened 6 years ago

di commented 6 years ago

For example, a search for python-miio has the python2-miio as the first result, likely because the 2 is being tokenized out of the query.

gsb-eng commented 6 years ago

@di Any progress on this? If not, Can I pick this?

brainwane commented 6 years ago

@gsb-eng Yes, please go ahead and pick it up! At least within Warehouse, our rule is: If there is not a corresponding pull request for this issue (visibly linked by issue number/PR number), and it's not marked as assigned to anyone, it is up for grabs. So in cases like this, you don't even need to ask. :)

As always, if you have questions along the way as you work on this, please feel free to ask them here, in #pypa-dev on Freenode, or the pypa-dev mailing list. Thank you!

waseem18 commented 6 years ago

@di Could you please help me in providing me more information about this issue?

When I searched for python2-miio the results on my local environment were completely different and irrelevant than that of pypy.org

di commented 6 years ago

@waseem18 Unfortunately I don't know much more than you here. We got a report that packages with a number in them could be prioritized higher in search than a package with the exact name as the search query.

However looking at https://pypi.org/search/?q=python-miio now, it seems like python-miio has the highest rank now. We will likely need to do some debugging of the elasticsearch query and the results to see why it (at some point) was possible.

waseem18 commented 6 years ago

Yeah - I was feeling the same. Because now the search queries seems to work as expected.

Will do some debugging and check.

di commented 6 years ago

Here's another current example of this happening:

https://pypi.org/search/?q=psycopg2

waseem18 commented 6 years ago

@di I presume the issue is because of usage of lowercase filter in normalized_name which is tokenizing psycopg2 to psycopg and 2

As per Elastic search,

lowercase tokenizer breaks text into terms whenever it encounters a character which is not a letter

di commented 6 years ago

Sounds like the same thing would happen with package names containing other valid non-alphanumeric characters like -, _, and .. Perhaps we need to tweak that tokenizer a bit.

waseem18 commented 6 years ago

Yeah - We should decide on which tokenizer to use. What about whitespace tokenizer? I'll check search results accuracy with whitespace tokenizer vs lowercase tokenizer.

di commented 6 years ago

@honzakral is able to help with this issue

honzakral commented 6 years ago

I would vote for using multiple tokenizers instead and then querying across all of them in a query. Trying to find "the perfect analyzer" is a tricky business and often leads down many rabbit holes.

Instead suplement the analyzer set by adding one that doesn't ignore the numbers - then running a query against both these analyzers will prioritize the proper package (with 2 in the name) as that one will match on both and not just the existing one. In addition I would also recommend we have a keyword field for the name to be able to prioritize exact matches.

To add multiple analyzers just use Text(analyzer="analyzer1", fields={'analyzer2': Text(analyzer="analyzer2", ...)})

i would be happy to help or answer any additional questions too.

selotape commented 6 years ago

Hey, is this STILL up for grabs? I'm a warehouse newbie with ES know-how.

waseem18 commented 6 years ago

Yes, it is @selotape

You can start looking into this issue. Feel free to ping here if you need help with anything.

selotape commented 6 years ago

Another example - https://pypi.org/search/?q=hachoir3

selotape commented 6 years ago

Hey,

I created a PR with a small fix and a description of my (incomplete) status. @waseem18 , can we take the conversation there?

https://github.com/pypa/warehouse/pull/4519

yeraydiazdiaz commented 5 years ago

Seems this issue is not happening anymore, searching for python-miio or psycopg2 yield the correct packages as first hits on the search.

di commented 5 years ago

Seems like this is still an issue:

I look up 'rfc6266' because I was told about the module and I want to know more about it. Said module appears sixth even though its name matches exactly my query. I look at the sorting criteria and see it’s supposed to be by relevance. I end up wondering how that is not the most relevant result.

mgorny commented 2 years ago

The same problem applies to searching for pep517: https://pypi.org/search/?q=pep517. The correct result doesn't appear until the second half of the first page.

aphedges commented 2 years ago

I encountered this myself recently, and I found multiple instances where it is broken:

However, search is not always wrong. Both https://pypi.org/search/?q=beautifulsoup4 and https://pypi.org/search/?q=flake8 work as expected.

takluyver commented 6 months ago

More examples:

di commented 6 months ago

Hi folks, I'm locking discussion on this issue -- while I appreciate the intention behind the replies, we don't need more examples of this happening as there are likely thousands, and there's nothing we can do to resolve specific cases.