Switch to repeatable Python setups

rgaudin commented 1 year ago

We are now sufficiently affected by sub-dependencies issues for it to be necessary.

Scrapers and other Python projects should all be switched from good-old requirements.txt to repeatable, frozen environments. Harmonizing how we build/publish (from python perspective, not publish workflows!) should probably be looked at as well.

I'd suggest using pipenv/Pipfile/Pipfile.lock but looking at what's the current recommended way is mandatory.

Note: this should be tackled with #13 and #14

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

rgaudin commented 1 year ago

I have eventually spent some time figuring this out.

TL;DR: https://github.com/openzim/overview/wiki/Python-Project-Bootstrap and https://github.com/rgaudin/bootstrap for a repo version ; if we agree on this workflow, we can move it to openzim/_python-bootstrap

The issue mentions repeatable setups and list tools. Of course, the issue is not a tooling issue but the result of years of lack of directive or openZIM convention.

Our main issue is that we explicitly, intentionally specified dependencies requirements as ranges, hoping that every developers down the tree would properly respect semantic versioning. Occasionally, this created an incident because some sub-sub-sub dependency got updated and something stopped working. Of course, this would happen when pushing some worry-less patch to one of our service and the CI would rebuild and we'd deploy and encounter a runtime error that needs to be fixed ASAP.

This example highlights various issues:

we have no release management for our web services (because it would most likely be a burden)
we don't have tests or not enough for our web services (historical debt mostly)

But this kind of problem would have not raised should we not allow updates to happen outside our control. And this is a direct developer responsibility.

So, to be very clear, we can continue to use a good old requirements.txt and be safe of those. We just need to write our dependencies properly.

But going over all our project's dependencies and fixing them is a tedious task that. We should thus use this opportunity to update our Python setup to modern standards (convention is more appropriate).

Answer to this ticket is thus a policy and a working example of what an openZIM python project should start from. Every project lives its life so this is just about the common expect-able blocks.

My criterias were:

Using pyproject.toml as this is the mordern and now widespread way to configure most python tools.
Single dependency definition. Obviously, specifying them multiple times is error prone and a maintenance hell to come.
Supported/popular tooling
No vendor lock-in.
Not intrusive to each developer's workflow

I excluded pipenv because it can not work solely off pyproject.toml. It would thus need maintaining both the Pipfile and the pyproject.toml. I like its lock file but we can live without it. Remaining risks are:

PyPi index is compromised and returning compromised packages. We could work around that should need be.
Pinned version's wheel is deleted. Pip would fallback to sdist which could behave differently. pipenv would suffer from this identically.

I chose hatch:

Developed, maintained and supported by pypa (PyPI)
Natively supports pyproject.yaml. Single file configuration!
Is a build-meta provider. With recent pip, you can build your packages with arbitrary builder tool, transparently. Very convenient as dev doesn't need to adapt to various tool but still benefits from its features.
Supports specifying dynamic version
handles environments transparently (creates and syncs them as needed)
supports environment matrix making it easy to test on multiple python version (ala tox)

Rest is pretty much self explanatory. @benoit74 @fledgeXu, we'd need your feedback on this. If you're OK with the general approach, we can discuss the details. If not, speak it up!

Ruff configuration (which rules we want to enforce) is one of those details. Can vary from project to project but we need a base.

As you are both starting new projects, those would use the new approach right away. Will help adjust it. I'll probably switch scraperlib as well and we'll use every dependency update opportunity to switch other projects.

I'd like to mention mypy (hint checking tool) which might be interesting down the road but I have no experience with it and I want us to be able to switch old repos to the new method easily.

kelson42 commented 1 year ago

@benoit74 @mgautierfr We are going to require this approach to all our Pythin project. If you have remarks, question or critics... this is the moment to speak out.

FledgeXu commented 1 year ago

I agree that we should exclude the pipenv. Here are some reasons:

pipenv doesn't support the pyproject.toml natively. In nautilus-webui case, it makes updating dependencies is annoying, since I have to delete Pipfile.lock every time.
pipenv's installation failed error output is unreadable and that makes debugging is terrible.

As for the package management, I don't have the any prefer and I think we should exclude the poetry too. Although poetry also uses the pyproject.toml, it has its own format AFAIK.

I have't tried Ruff before, so I have no comment about it. But I have experience with mypy, I personally used it as my auto-completion tool. It has better support for the pydantic than the pyright. In our case (FastAPI stack), it's a good choice.

benoit74 commented 1 year ago

I have only two comments regarding the general approach :

I consider that we must have a type checking tool like mypyor pyright in the "openZIM standard" ; it is ok for me to decide to not use it for good reasons, but it must be in the "openZIM standard" so that we have to justify why we do not use it (typically on new tooling project targeting 3.11 and starting from scratch, there must be no missing type hints and it helps to discover issues rather fast ; my recent experience with mypy was worse than with pyright (see https://github.com/sqlalchemy/sqlalchemy/issues/9919) but my experience is still very limited and this is a detail
It is not totally linked, but I would recommend to decide to really go multi-repo, i.e. if a tool has a front in Vue.JS and a backend in Python, they are two distinct Github projects ; the in-between solution of today (many projects so definitely not monorepo but still many technology in one project) is the worse from my point of view. We are not mono-repo so we have to regularly setup new projects but it is complex because we have many different stuff to put at the same place instead of benefiting from more straightforward community tooling. For instance setting up a bot like dependabot would probably be rather complex. At least this "in-between" situation was a difficulty for me in openzim/zimfarm and is now a difficulty in offspot/metrics (e.g. how to not re-test / re-build everything - front + back - on every pull request).

Congrats for everything else ! (I have some details that I do not understand, but this is a different topic) 👏

benoit74 commented 1 year ago

Oh, one last thing : I believe that we must decide how we will decide to update this 🚀

This is not a joke, because:

I consider that I have not enough maturity / experience to have a sound opinion on all details
I consider that what has been done is a good starting point and it is better to try to put this in place and have a clear process to update this rather that discussing about details during hours ; should it become clear that there is a real pain point / wrong decision that was made, we will just have to update this
I'm sure that in the long run, stuff will change (new tools, new PEP, new way of working) which will have to be reflected in this

This does not have to be a complex thing, I would propose to simply open an issue in this openzim/overview project with details about the problem encountered, why we believe this is an issue, if possible what we propose to solve this, and then discuss it together.

mgautierfr commented 1 year ago

I would like to mention pip-tools which provide a "simple" way to manage dependencies and especially allow us to have a intentionally specified dependencies requirements as ranges and, in the same time, to lock the all dependencies to a specific version to ensure our deployment doesn't have unexpected dependency update.

It is not as full as hatch (which I don't know) but it is not its purpose neither.

How hatch behaves in regards of cython ? Poetry doesn't support it and you have, at best, to use a custom python script to run cython (https://yannlg.tech/blog/2022-07-23-cython-with-poetry).

rgaudin commented 1 year ago

I consider that we must have a type checking tool like mypyor pyright in the "openZIM standard"

I share this POV. We can be flexible with older code bases but should enforce type-hints on new projects.

I excluded it because I am not comfortable recommending anything at the moment. I have not used checking tools and while we've been using hints for a while now, it was only to our own human readability benefits. Maybe we should try it on a new project and add it to the convention once we're comfortable with something.

I would recommend to decide to really go multi-repo

Not discussing this here 😅 That's a different topic with different stakeholders and constraints. Please open another ticket with your arguments.

we must decide how we will decide to update this

I agree. If we go the repo way (openzim/_python-bootstrap for instance), then we can have tickets/PRs there and simply update the code in the repo. main would thus represent the latest recommendation. The least overhead the better IMO.

I would like to mention pip-tools

I had pip-tools in mind when writing that We could work around the lack of lock file. I know it can generate hash-based requirements and it works off pyproject.toml so it would be relatively painless to include when needed.

I'm not sure I see a use case for intentional ranges beside our libraries (which are by definition not deployed) other than CI tests though.

How hatch behaves in regards of cython ?

From what I've seen, a standardization of C-extension building is in progress and once/if this PEP gets accepted, other builder would support building extensions directly.

Regarding pylibzim (our only C-extension with no plan to expand), I suggest we harmonize the linling and and QA but keep the setuptools and setup.py because we have some complex stuff going on there and cython is still tied to setuptools.

The goal of all this is to simplify our work with no-brainer answers for common stuff. Doesn't prevent us from doing differently where we see fit.

kelson42 commented 1 year ago

It is not totally linked, but I would recommend to decide to really go multi-repo,

@benoit74 I would recommend to make your point in a dedicated ticket.

rgaudin commented 1 year ago

Closing this meta ticket after creating https://github.com/openzim/_python-bootstrap and some Wiki entries

openzim / overview

Switch to repeatable Python setups #16