Closed rgaudin closed 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
I have eventually spent some time figuring this out.
TL;DR: https://github.com/openzim/overview/wiki/Python-Project-Bootstrap and https://github.com/rgaudin/bootstrap for a repo version ; if we agree on this workflow, we can move it to openzim/_python-bootstrap
The issue mentions repeatable setups and list tools. Of course, the issue is not a tooling issue but the result of years of lack of directive or openZIM convention.
Our main issue is that we explicitly, intentionally specified dependencies requirements as ranges, hoping that every developers down the tree would properly respect semantic versioning. Occasionally, this created an incident because some sub-sub-sub dependency got updated and something stopped working. Of course, this would happen when pushing some worry-less patch to one of our service and the CI would rebuild and we'd deploy and encounter a runtime error that needs to be fixed ASAP.
This example highlights various issues:
But this kind of problem would have not raised should we not allow updates to happen outside our control. And this is a direct developer responsibility.
So, to be very clear, we can continue to use a good old requirements.txt
and be safe of those. We just need to write our dependencies properly.
But going over all our project's dependencies and fixing them is a tedious task that. We should thus use this opportunity to update our Python setup to modern standards (convention is more appropriate).
Answer to this ticket is thus a policy and a working example of what an openZIM python project should start from. Every project lives its life so this is just about the common expect-able blocks.
My criterias were:
pyproject.toml
as this is the mordern and now widespread way to configure most python tools.I excluded pipenv because it can not work solely off pyproject.toml
. It would thus need maintaining both the Pipfile
and the pyproject.toml
.
I like its lock file but we can live without it. Remaining risks are:
I chose hatch
:
pyproject.yaml
. Single file configuration!version
tox
)Rest is pretty much self explanatory. @benoit74 @fledgeXu, we'd need your feedback on this. If you're OK with the general approach, we can discuss the details. If not, speak it up!
Ruff configuration (which rules we want to enforce) is one of those details. Can vary from project to project but we need a base.
As you are both starting new projects, those would use the new approach right away. Will help adjust it. I'll probably switch scraperlib as well and we'll use every dependency update opportunity to switch other projects.
I'd like to mention mypy (hint checking tool) which might be interesting down the road but I have no experience with it and I want us to be able to switch old repos to the new method easily.
@benoit74 @mgautierfr We are going to require this approach to all our Pythin project. If you have remarks, question or critics... this is the moment to speak out.
I agree that we should exclude the pipenv
. Here are some reasons:
pipenv
doesn't support the pyproject.toml
natively. In nautilus-webui
case, it makes updating dependencies is annoying, since I have to delete Pipfile.lock
every time.pipenv
's installation failed error output is unreadable and that makes debugging is terrible.As for the package management, I don't have the any prefer and I think we should exclude the poetry
too. Although poetry
also uses the pyproject.toml
, it has its own format AFAIK.
I have't tried Ruff
before, so I have no comment about it. But I have experience with mypy
, I personally used it as my auto-completion tool. It has better support for the pydantic
than the pyright
. In our case (FastAPI
stack), it's a good choice.
I have only two comments regarding the general approach :
mypy
or pyright
in the "openZIM standard" ; it is ok for me to decide to not use it for good reasons, but it must be in the "openZIM standard" so that we have to justify why we do not use it (typically on new tooling project targeting 3.11 and starting from scratch, there must be no missing type hints and it helps to discover issues rather fast ; my recent experience with mypy
was worse than with pyright
(see https://github.com/sqlalchemy/sqlalchemy/issues/9919) but my experience is still very limited and this is a detailopenzim/zimfarm
and is now a difficulty in offspot/metrics
(e.g. how to not re-test / re-build everything - front + back - on every pull request).Congrats for everything else ! (I have some details that I do not understand, but this is a different topic) 👏
Oh, one last thing : I believe that we must decide how we will decide to update this 🚀
This is not a joke, because:
This does not have to be a complex thing, I would propose to simply open an issue in this openzim/overview
project with details about the problem encountered, why we believe this is an issue, if possible what we propose to solve this, and then discuss it together.
I would like to mention pip-tools which provide a "simple" way to manage dependencies and especially allow us to have a intentionally specified dependencies requirements as ranges
and, in the same time, to lock the all dependencies to a specific version to ensure our deployment doesn't have unexpected dependency update.
It is not as full as hatch (which I don't know) but it is not its purpose neither.
How hatch behaves in regards of cython ? Poetry doesn't support it and you have, at best, to use a custom python script to run cython (https://yannlg.tech/blog/2022-07-23-cython-with-poetry).
I consider that we must have a type checking tool like mypyor pyright in the "openZIM standard"
I share this POV. We can be flexible with older code bases but should enforce type-hints on new projects.
I excluded it because I am not comfortable recommending anything at the moment. I have not used checking tools and while we've been using hints for a while now, it was only to our own human readability benefits. Maybe we should try it on a new project and add it to the convention once we're comfortable with something.
I would recommend to decide to really go multi-repo
Not discussing this here 😅 That's a different topic with different stakeholders and constraints. Please open another ticket with your arguments.
we must decide how we will decide to update this
I agree. If we go the repo way (openzim/_python-bootstrap
for instance), then we can have tickets/PRs there and simply update the code in the repo. main
would thus represent the latest recommendation. The least overhead the better IMO.
I would like to mention pip-tools
I had pip-tools in mind when writing that We could work around the lack of lock file. I know it can generate hash-based requirements and it works off pyproject.toml
so it would be relatively painless to include when needed.
I'm not sure I see a use case for intentional ranges beside our libraries (which are by definition not deployed) other than CI tests though.
How hatch behaves in regards of cython ?
From what I've seen, a standardization of C-extension building is in progress and once/if this PEP gets accepted, other builder would support building extensions directly.
Regarding pylibzim (our only C-extension with no plan to expand), I suggest we harmonize the linling and and QA but keep the setuptools and setup.py because we have some complex stuff going on there and cython is still tied to setuptools.
The goal of all this is to simplify our work with no-brainer answers for common stuff. Doesn't prevent us from doing differently where we see fit.
- It is not totally linked, but I would recommend to decide to really go multi-repo,
@benoit74 I would recommend to make your point in a dedicated ticket.
Closing this meta ticket after creating https://github.com/openzim/_python-bootstrap and some Wiki entries
We are now sufficiently affected by sub-dependencies issues for it to be necessary.
Scrapers and other Python projects should all be switched from good-old requirements.txt to repeatable, frozen environments. Harmonizing how we build/publish (from python perspective, not publish workflows!) should probably be looked at as well.
I'd suggest using
pipenv
/Pipfile
/Pipfile.lock
but looking at what's the current recommended way is mandatory.Note: this should be tackled with #13 and #14