Open di opened 6 years ago
@di Any progress on this? If not, Can I pick this?
@gsb-eng Yes, please go ahead and pick it up! At least within Warehouse, our rule is: If there is not a corresponding pull request for this issue (visibly linked by issue number/PR number), and it's not marked as assigned to anyone, it is up for grabs. So in cases like this, you don't even need to ask. :)
As always, if you have questions along the way as you work on this, please feel free to ask them here, in #pypa-dev
on Freenode, or the pypa-dev mailing list. Thank you!
@di Could you please help me in providing me more information about this issue?
When I searched for python2-miio
the results on my local environment were completely different and irrelevant than that of pypy.org
@waseem18 Unfortunately I don't know much more than you here. We got a report that packages with a number in them could be prioritized higher in search than a package with the exact name as the search query.
However looking at https://pypi.org/search/?q=python-miio now, it seems like python-miio
has the highest rank now. We will likely need to do some debugging of the elasticsearch query and the results to see why it (at some point) was possible.
Yeah - I was feeling the same. Because now the search queries seems to work as expected.
Will do some debugging and check.
Here's another current example of this happening:
@di I presume the issue is because of usage of lowercase
filter in normalized_name which is tokenizing psycopg2
to psycopg
and 2
As per Elastic search,
lowercase tokenizer breaks text into terms whenever it encounters a character which is not a letter
Sounds like the same thing would happen with package names containing other valid non-alphanumeric characters like -
, _
, and .
. Perhaps we need to tweak that tokenizer a bit.
Yeah - We should decide on which tokenizer to use. What about whitespace
tokenizer? I'll check search results accuracy with whitespace
tokenizer vs lowercase
tokenizer.
@honzakral is able to help with this issue
I would vote for using multiple tokenizers instead and then querying across all of them in a query. Trying to find "the perfect analyzer" is a tricky business and often leads down many rabbit holes.
Instead suplement the analyzer set by adding one that doesn't ignore the numbers - then running a query against both these analyzers will prioritize the proper package (with 2 in the name) as that one will match on both and not just the existing one. In addition I would also recommend we have a keyword
field for the name to be able to prioritize exact matches.
To add multiple analyzers just use Text(analyzer="analyzer1", fields={'analyzer2': Text(analyzer="analyzer2", ...)})
i would be happy to help or answer any additional questions too.
Hey, is this STILL up for grabs? I'm a warehouse newbie with ES know-how.
Yes, it is @selotape
You can start looking into this issue. Feel free to ping here if you need help with anything.
Another example - https://pypi.org/search/?q=hachoir3
Hey,
I created a PR with a small fix and a description of my (incomplete) status. @waseem18 , can we take the conversation there?
Seems this issue is not happening anymore, searching for python-miio
or psycopg2
yield the correct packages as first hits on the search.
Seems like this is still an issue:
I look up 'rfc6266' because I was told about the module and I want to know more about it. Said module appears sixth even though its name matches exactly my query. I look at the sorting criteria and see it’s supposed to be by relevance. I end up wondering how that is not the most relevant result.
The same problem applies to searching for pep517
: https://pypi.org/search/?q=pep517. The correct result doesn't appear until the second half of the first page.
I encountered this myself recently, and I found multiple instances where it is broken:
Jinja
first and Jinja2
as secondchardet2
first and chardet
secondmypy1989
first and mypy
secondpandas2
first and pandas
secondHowever, search is not always wrong. Both https://pypi.org/search/?q=beautifulsoup4 and https://pypi.org/search/?q=flake8 work as expected.
Hi folks, I'm locking discussion on this issue -- while I appreciate the intention behind the replies, we don't need more examples of this happening as there are likely thousands, and there's nothing we can do to resolve specific cases.
For example, a search for
python-miio
has thepython2-miio
as the first result, likely because the2
is being tokenized out of the query.