scientific-python / summit-2023

Work summit 2023
1 stars 0 forks source link

Reduce cross-project effort when shared dependencies change/break #20

Open drammock opened 1 year ago

drammock commented 1 year ago

EDIT: PR title changed to reflect the nature of the problem; initial PR description below is just one idea towards solving it

We could maintain forks of various key dependencies at the ecosystem level, and ecosystem projects could then depend on / install from there. Hopefully this results in fewer cases of "shoot, dependencyX is breaking our CIs, I don't have time to investigate right now so I'll just pin it" happening across many ecosystem packages. Examples of possible dependencies where this might make sense:

Some of these might be cases of "fork-and-freeze", with a predetermined schedule of when upstream changes are pulled in. Prior to such updates, a couple of ecosystem packages could temporarily switch to installing the upstream in a PR to see if any CIs break, and if they do, the update could be delayed until the breakage was resolved.

Others might need to be occasionally maintained at the ecosystem-fork level (here I'm thinking about recent docutils release where node.traverse replaced node.findall and it had to be fixed in numpydoc, sphinxcontrib-bibtex, and a couple other of MNE-Python's dependencies). Having ecosystem-level forks would mean that as soon as any of us noticed a problem, a patch could be made at the ecosystem fork level and the fix would be available to the ecosystem immediately (i.e., without waiting for the patch to be accepted upstream).

Questions:

tupui commented 1 year ago

I think that this is a very risky thing to do. What if all packages of the ecosystem rely on such forks and then our fix is not accepted upstream? This does not seem very healthy for the whole community IMHO.

bsipocz commented 1 year ago

I wholeheartedly agree with @tupui, it's difficult enough to find enough bandwidth to ensure the maintenance of the plugins and extensions, and be on top of the infrastructure. Taking on the maintenance of forks would likely require an even larger effort and would create an unhealthy ecosystem. (And yes, patches we pushed upstream often enough were not accepted in these more generic, and non-scientific-usage focused tools. So we had to come up with workarounds in the extensions downstream. I would not recommend doing it in a fork).

drammock commented 1 year ago

@tupui @bsipocz I agree that this is risky.

do you have other ideas about how to reduce cross-project effort when breaking changes happen in shared dependencies?

tupui commented 1 year ago

I do 😃 To me it all starts with communication between projects. i.e. like we are going to do with the summit 🎉

tupui commented 1 year ago

To echo with your proposal though, I do think having projects under the same umbrella can help. Which we sort of have thanks to initiatives like Scientific Python, NumFOCUS, PyData.

drammock commented 1 year ago

NumFOCUS is a fiscal sponsorship agreement, perhaps not all projects can/want to be part of it, so we shouldn't rely on that. PyData doesn't include domain-specific packages AFAICT, so also doesn't include all stakeholders.

This Scientific Python ecosystem seems like the right place to centralize communication, but how exactly? I'll change the issue title to reflect better the problem (rather than my risky solution idea).

bsipocz commented 1 year ago

A couple of years ago I used to maintain https://github.com/astropy/ci-helpers for this exact same purpose (cross-project pinning of dependencies to ease individual CI adjustments), but things became reasonably more compatible.

I agree that communication is key, but also dev testing helps enormously.

What we could do, in addition to communication, is to maintain a 1) pin file, 2) some share e.g. tox config. But neither of these looks super feasible as at some point packages need to understand these breakages and update themselves anyway, or use a very similar CI setup.

For the communication part, I feel the scientific python discord already does some heavy lifting. Recent examples include the codecov removal from pypi, issues with circleci-artifact-redirector action.

drammock commented 1 year ago

I used to maintain https://github.com/astropy/ci-helpers

belated thanks for that work! We (MNE-Python) benefitted from that.

the scientific python discord already does some heavy lifting

didn't know about that, just joined, thanks

we could [...] maintain a 1) pin file, 2) some share e.g. tox config. But neither of these looks super feasible as at some point packages need to understand these breakages and update themselves anyway, or use a very similar CI setup.

Maybe in the end it's not practical to centralize anything more than information. It would be nice if it were somewhere a bit more public than Discord though (i.e. findable with a search engine). If there were a centralized place to find out if/how other packages have fixed such problems I think that could save a lot of maintainers some time.

Here's another motivating example, the deprecation of distutils.version.LooseVersion: we fixed it in MNE-Python, discovered that StatsModels had done it differently, and then passed both solutions to Nilearn. A way to easily find/share such things more widely would be great.

bsipocz commented 1 year ago

indeed, that's a good example too. Just to add to the mixture, astropy complicated it a bit more, and added astropy.utils.minversion (but under the hood, it's very much the StatsModels approach). But honestly, most things in that utils module could be used more generarily.

tupui commented 1 year ago

And we also have our own version in SciPy 😅but this is only because we don't want another runtime deps.

I agree with @bsipocz that since we are using Discord and also our Slack channels, we are more than ever working together to find common solution. There is even bigger things with grants that we do together.

ksunden commented 1 year ago

This conversation became quite relevant today to matplotlib, as sphinx has released a version that breaks one of our (internal use only) extensions.

I wonder if it would be better to instead of "fork and freeze", etc, to distribute the workload of maintaining these shared upstreams, offering at the very least some amount of PR review (and perhaps enforcing that changes go through PRs in some cases... the particular change that affects mpl from the most recent sphinx was pushed directly to master) To be clear, not trying to rag on them too much, I see it as a symptom of not having the maintenance support they need. (See also https://xkcd.com/2347/)

Perhaps we could also get them to buy into some level of testing against our packages in CI (crossref #16) to catch problems before a release and figure out how and where to address them prior to it becoming a problem which fails our CI and causes us all to launch into firefighting mode.

drammock commented 1 year ago

I wonder if it would be better to instead of "fork and freeze", etc, to distribute the workload of maintaining these shared upstreams

That seems ideal to me, and to some extent it's what we (mne-python devs) are already doing. I'm sure this is true of other devs in other projects too. So I wonder how realistic it is to tack on Sphinx maintainer duties... Maybe in the long term the ecosystem could propose funding that includes dev time just for key dependencies?

Perhaps we could also get them to buy into some level of testing against our packages in CI

Also a good idea. This is something that might be achievable in a fork with a bot? Which means we wouldn't even need buy-in.

bsipocz commented 1 year ago

I wonder if it would be better to instead of "fork and freeze", etc, to distribute the workload of maintaining these shared upstream

An iteration/slightly different flavor of this is https://github.com/scientific-python/summit-2023/issues/10, which proposes that we should at least bring the extensions/plugins under the more generic umbrella of scientific python, to pool our very limited resources here.

As for maintaining the upstream infrastructure libraries, call me pessimistic, but I feel adding an infrastructure dev version job to the CI to each and every core scientific python libraries (at least the ones that experienced issues with sphinx/pytest releases) is a more realistic solution. Or at least the minimum we should do. Maybe it even deserves its own SPEC (I'm more than happy to champion it during the summit), or be it a part of https://scientific-python.org/specs/spec-0005/ .

tupui commented 1 year ago

I wonder if it would be better to instead of "fork and freeze", etc, to distribute the workload of maintaining these shared upstream

An iteration/slightly different flavor of this is #10, which proposes that we should at least bring the extensions/plugins under the more generic umbrella of scientific python, to pool our very limited resources here.

Agreed, we can only know what we know. Moving projects under a common umbrella would help that as maintainers would be more exposed to the projects which are part of the same org. I am having in mind the Sphinx theme here for instance (I know I said that already 😅.) There might be other ways, I am thinking about the stacks that we want to have. That could replace or complement that.

bsipocz commented 1 year ago

Oh, yes, the template totally falls under that category. Realistically, in every project, there is a small fraction of people dealing with these types of issues, and pulling us together into one place to communicate and share probably provides a solution for the vast majority of cases, without the burden of spreading the same personnel too thin by trying to take on maintenance of upstream projects where the priorities are not necessarily aligned with the need of the scientific libraries.

bsipocz commented 1 year ago

priorities are not necessarily aligned with the need of the scientific libraries

saying this with knowing the experience of trying to upstream some of the features from the extensions and plugins we have. Even when there was a welcoming attitude from upstream maintainers, it's not that easy to actually move things upstream and ensure it is being kept maintained, and I definitely had a fair share of closed upstream PRs, too where our needs were deemed not within the scope.

drammock commented 1 year ago

call me pessimistic, but I feel adding an infrastructure dev version job to the CI to each and every core scientific python libraries (at least the ones that experienced issues with sphinx/pytest releases) is a more realistic solution.

MNE-Python is doing that already. Getting repos that aren't doing that to do it is a good idea, but doesn't address the question of "when things break on my pip-pre CI, where do I look to see if a solution is already out there?"

Moving projects under a common umbrella would help that as maintainers would be more exposed to the projects which are part of the same org. I am having in mind the Sphinx theme here

@tupui you are suggesting that pydata-sphinx-theme should exist at scientific-python/sphinx-theme instead of at pydata/pydata-sphinx-theme? Maybe I'm misunderstanding but it is not clear to me how this will help. Speaking as one of the few active maintainers of that theme, changing the URL won't magically make it so that I have time to follow the development chatter of NumPy / Scipy / AstroPy / matplotlib / etc or become familiar with each of their build processes and website designs (which IMO is what it would take for theme development to be more responsive to the needs of all the packages in the ecosystem).

To me it seems both more feasible and more appropriate for someone from each package/community (at least the larger ones) to be the "point-person" on the sphinx theme, and for that person to serve as a maintainer of the theme. I don't think this approach scales well to all dependencies; as @bsipocz notes there are challenges with upstreaming fixes to dependencies that are outside the ecosystem, and we can't count on dev teams for those projects to simply grant maintainer status / commit rights to whoever comes asking. But the pydata sphinx theme is not like that, it is already inside the ecosystem, and we are generous with permissions.

tupui commented 1 year ago

@tupui you are suggesting that pydata-sphinx-theme should exist at scientific-python/sphinx-theme instead of at pydata/pydata-sphinx-theme? Maybe I'm misunderstanding but it is not clear to me how this will help. Speaking as one of the few active maintainers of that theme, changing the URL won't magically make it so that I have time to follow the development chatter of NumPy / Scipy / AstroPy / matplotlib / etc or become familiar with each of their build processes and website designs (which IMO is what it would take for theme development to be more responsive to the needs of all the packages in the ecosystem).

Sure this will not be a magical thing and solve the lack of maintenance time.

This is not just about a URL. There are a few things that it would imply at least to me. For one, this would send a message of unity and coherence to the community.

(I am putting aside maintainers, I am talking about users. As you said earlier, we have (now) NumFOCUS, PyData, Scientific Python, some bits of SciPy with the conf, various grouping of packages and in all this mix you even have "friendly" companies like Conda or Quansight. Oh and OC, there is the PSF above. Even for maintainers or folks aware of all the mechanics, it's easy to lose track of what is what and who is doing what, what sorts of plans are in place, etc. So I think it would be great to move towards clarifying all of that. Yes, things like moving the theme and also renaming the SciPy conf.)

If we have a clear structure, it's easier to give clear responsibilities and prerogatives to all parties. It's also showing a that we are able to organize ourselves efficiently and we can have a stronger voice not only in the general Python community, but also in the software industry.

I am also confident that this would help make stronger proposals for grants; and also my dream, if we make such groups we could maybe draft some real concrete plans for everything so that a company can come and say: "ok I have 1M, what can this do tomorrow?" Today, I cannot answer this question. If someone can great, but until this info is public, it's worthless if investors don't see it.

To me it seems both more feasible and more appropriate for someone from each package/community (at least the larger ones) to be the "point-person" on the sphinx theme, and for that person to serve as a maintainer of the theme. I don't think this approach scales well to all dependencies; as @bsipocz notes there are challenges with upstreaming fixes to dependencies that are outside the ecosystem, and we can't count on dev teams for those projects to simply grant maintainer status / commit rights to whoever comes asking. But the pydata sphinx theme is not like that, it is already inside the ecosystem, and we are generous with permissions.

I agree with that part and to me this is not one or the other but a complement. I have been filling this role for SciPy and partially for NumPy (I also advocated and moved a lot of other open-source projects to this theme.) I have to say that for now, I did not see that it made a difference that I was having this role.

From my perspective, the problem that I have is that we simply don't talk.

On another note, Discord has been helping a bit and I feel I have way more interactions with other projects now. But we are still missing focus groups.

I know that the current tone in the general scientific Python ecosystem is that we don't "assign" or give "work" to people. I disagree with that in the sense that 1. we at least have a moral responsibility as maintainers of such influential libraries and 2. I would argue that this is this current state of mind which lead to our difficulty to recruit maintainers. e.g. I know tons of folks in the industry which tell me that they would never contribute to OSS just because we appear to be unorganized and have no plans. I am always using this as an example, IRL if you have an association (any domain), you have people responsible for everything. But not only that, they also have obligations and duties to maintain their status/rank/privileges. As a maintainer, I should have some duties to maintain this status and it should not just be a "free ride."

Just in case as only convo being online convo: I am enthusiastic about this conversation and it's a good thing that we are having such talks 😃 I very much looking forward to the summit to discuss all of that further.

drammock commented 1 year ago

my dream, if we make such groups we could maybe draft some real concrete plans for everything so that a company can come and say: "ok I have 1M, what can this do tomorrow?"

I like the big-picture thinking and I admire your ambition!

I have been filling this role for SciPy and partially for NumPy

I wasn't aware of this, it looks like most of your PRs happened last spring when I was on parental leave. Thanks for that! (there are plenty of open issues if you want to put on the maintainer hat again 😉)

I know that the current tone in the general scientific Python ecosystem is that we don't "assign" or give "work" to people.

True that I rarely if ever tell someone to do something, except maybe a contributor in a PR review (e.g. "add a test" or "use f-strings not %" or whatever). But I am constantly asking people to do more work (see above 🙂). Given the power dynamics and incentive structures of the community (i.e., almost everyone is a volunteer and the rewards are minimal) I see this as appropriate (aknowledging that they don't owe me the work).

I know tons of folks in the industry which tell me that they would never contribute to OSS just because we appear to be unorganized and have no plans

That is unfortunate. Partly true, partly a PR problem. Definitely worth thinking about ways to address it.

As a maintainer, I should have some duties to maintain this status and it should not just be a "free ride."

Huh? Almost everyone who works on MNE-Python is unpaid for that work. Most are doing it on their nights and weekends. This includes most members of our steering committee. We have a process for "de-commissioning" steering committee members who are inactive. It's hard for me to think of any of those people as "free riders", especially when "I'm a maintainer of an open-source software package" still doesn't pull much weight with most hiring or promotion committees. In other words, from where I'm sitting it's not "free" and the "ride" doesn't go very far 😆

tupui commented 1 year ago

I like the big-picture thinking and I admire your ambition!

Hopefully one day it serves something 😄

Elaborating a bit on the "free ride" part and clubs 😅 Note that what I am saying would not necessarily apply to small projects, here I am talking about the big ones like the theme, SciPy, NumPy, Astropy, etc.

I agree that a lot of folks are pure volunteers and doing this on their free time. Even now that I am Quansight, I still spend a looooot of my personal time around.

The "free ride" is manyfold. There are folks who don't do much anymore and are still listed as maintainers of prominent projects (the level of activity is hard to quantify without metrics, and metrics are hard to come by as usual.) "free" also means, maintainers without jobs, tasks, responsibilities you are appointed to.

As a maintainer of a big project, yes I am supposed to share the "burden" of the backlog, etc. But I am also getting the bonus of saying that I am a maintainer. Personally this got me my last 3 jobs. So to me, besides what I hope my contributions would achieve for a project or the wider Scientific community, I also see this work as an investment in myself. And given the benefits I am getting out of this "position", I would not be "surprised" to get some extra duties in exchange of the title of maintainer. Otherwise, I could just stay an active contributor. This is the important distinction to me: maintainer vs contributor.

To go back to a club example: if you want to get into some hiking club, you would probably need to pay a fee to join the club. Then when doing some hikes, some folks would be responsible for some things like transport, safety, routes. And these same folks are also mostly volunteers, pay the same fees, etc.