monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.19k stars 254 forks source link

Improve Python usage harness #368

Closed jribbens closed 2 months ago

jribbens commented 2 months ago

The current Python usage harness:

This patch fixes all these issues. In some simple tests, the updated is_crawler() function is well over a hundred times faster than the current version.

monperrus commented 2 months ago

Thanks a lot for the contribution. Would you also be able to contribute with a test case run in CI? That would be awesome!

jribbens commented 2 months ago

Ok I've added some simple tests, updated pyproject.toml a bit (dev dependencies are now in it rather than a separate requirements.txt, and it defines a project homepage), and changed the CI workflow so it uses pip install -e .[dev] so that the test file can import crawleruseragents in order to test it.

jribbens commented 2 months ago

If you're interested, I also reworked the Python harness to be completely type-annotated and other fixes so it can be fully type-checked and linted with 'mypy --strict', 'ruff check' and 'ruff format': https://github.com/jribbens/crawler-user-agents/compare/master...jribbens:crawler-user-agents:full-typing

This involves considerably more changes to the code though, of course, and it means it's only compatible with Python 3.8 or later (not sure what version you were targeting previously).

monperrus commented 2 months ago

thanks a lot @jribbens