pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.71k stars 2k forks source link

RFC: polars 1.0 #6616

Closed ritchie46 closed 1 year ago

ritchie46 commented 1 year ago

I want to go to a polars 1.0 and I want to treat it as just another version. That means that we won't see 1.0 as something sacred, I think keeping 1.0 for too long is not something we should aim for as mistakes need to be corrected and are often small. E.g. a small search and replace is often enough.

My rationale is reading the semantic versioning snippet:

How do I know when to release 1.0.0?

If your software is being used in production, it should probably already be 1.0.0. If you have a stable API on which users have come to depend, you should be 1.0.0. If you’re worrying a lot about backwards compatibility, you should probably already be 1.0.0.

A lot of companies have a policy (which I don't agree with), to not use 0.x software. This is a shame as the core of polars should definitely be considered production ready.

I hope we can release to a 1.0 soon indicating that we are quite certain about our production fit and the core of our API. However, do think we should not be afraid to go to 2.0 and so forth. Even if we are super stable, I'd like to have a yearly release where we break with our mistakes. These breaking changes may be small, but not fixing them means we don't aim for the best technical solution and I just don't want to make that concession.

stinodego commented 1 year ago

Our breaking releases are becoming much less frequent already, so I think this is a good move!

And I think it doesn't hurt to be more critical on breaking changes going forward. Sometimes I feel we go ahead with breaking changes without having thought them through fully. Being on 1.0.0 will hopefully help in that regard.

Also, we have functionality for hiding things behind an experimental flag, so this should help keep pushing out new features while keeping breaking changes to a minimum.

alexander-beedie commented 1 year ago

How about implementing CalVer versioning?

Rationale:

I think it would meet most of the goals above, and it's a format that has really grown on me over time. It can remove the stigma of < 1.0 releases and has a nice regular cadence to it (as does polars itself), while also having some additional intuitive meaning, as year/month information is available at a glance.

As a standard it seems to have grown significantly, and there are plenty of good examples to pattern it after. It is not tied to any specific language or programming domain; you can find examples in software as diverse as OS (Ubuntu), IDE (PyCharm), packages (fsspec), etc.

Example:

# package is the 123rd release in Feb '23 (busy month :)
polars==2023.02.123

Counter-argument:

The argument against would probably be a strict adherence to the "major version = breaking change" aspect of SemVer, but I'm not convinced this is compelling - in the real world it is poorly observed (as an example, the number of breaking changes made to pandas since 1.0 is enormous but it is still on 1.x - SemVer can often just reflect arbitrary developer preference, and/or the feeling that "maybe we should do a major version release to show we're still alive & kicking" :)

More information:

ritchie46 commented 1 year ago

I really like this idea @alexander-beedie. I hate semver because of this promise into a single number while we will always will add logic on the periphery of our API.

I know that postgres and dask use your proposed versioning schema and I have looked into it already for polars, but I believe this was not possible with maturin. @messense I will tune you in here. Do you think non-semantic versioning is possible?

messense commented 1 year ago

I believe this was not possible with maturin.

maturin simply uses the version from pyproject.toml or Cargo.toml so if any of them supports CalVer, it should be fine. And I think at least pyproject.toml supports CalVer since it's PEP 440.

alexander-beedie commented 1 year ago

Yup, it's definitely PEP440-compliant - there is an explicit section relating to using dates as version numbers, since you can directly map major.minor.micro to yyyy.mm.micro, and increment micro as much as you like.

Date based release segments are also permitted. An example of a date based release scheme using the year and month of the release: ...

(In essence CalVer maps straight to a SemVer with an unusually large major version :)

stinodego commented 1 year ago

I think CalVer is amazing for the 'future' flow you describe, where you do breaking changes once a year regardless of stability. black does this, where they have a new 'stable' style at the start of each calendar year (23.1.0 was just released), and throughout the year their new options are available through the --experimental flag.

The great thing is that by doing it this way, they are also compatible with SemVer.

I don't think we're there yet with Polars. We will probably want to break things more than once a year. If we're on CalVer, that means we're no longer compatible with SemVer.

And the nice part about SemVer is that users will know when we're doing breaking changes. The fact that other packages do not take the semantics of SemVer seriously, doesn't mean it's not a good format for managing user expectations about new versions.

Why not start with SemVer, and see how this develops, and then transition to CalVer later when a breaking release once a year is enough?

kylebarron commented 1 year ago

Possibly useful in this discussion: one of the Dask maintainers wrote up his thoughts after Dask has been on CalVer for a couple years

https://jacobtomlinson.dev/posts/2023/sometimes-i-regret-using-calver/

stinodego commented 1 year ago

https://jacobtomlinson.dev/posts/2023/sometimes-i-regret-using-calver/

Very good write-up. This is also why I am hesitant to go forward with CalVer. I think this part nails it:

In my opinion, CalVer signals to your community that anything could happen at any time and that you have no interest in the effect that has on your users. The project has been tested to be working on a given date, but it is an exercise for the user to figure out how much effort it would take for them to upgrade. I don’t feel this is very respectful of users time and effort.

gam-phon commented 1 year ago

I agree with @stinodego

For me:

I like the Django release process. Especially upgrading from LTS to LTS. before 2.0 there were huge efforts in upgrading from LTS to another LTS you have to upgrade to each version until you reach to the next LTS but with their new(now it is old ^_^) release I can stuck at an older LTS version and skip all next versions to the next LTS doing the upgrade without worrying about versions between LTS and by just fixing the warning in old LTS I can upgrade to the next LTS without even reading the release notes.

Django is a mature project, so I am not expecting the same here but maybe we could learn from their process

So, Maybe each year there will be 3 releases: 1.0, 1.1, 1.2(1 year support or longer to make time for fixing the warning) next year 2.0, 2.1, 2.2 ... and so on that will keep you make breaking changes every year and make the upgrade easily from 1.2 to 2.0 or 2.1 or even 2.2

s-banach commented 1 year ago

I love the package, I admire you all as programmers. As a non-contributing, non-programmer, I understand my opinion is probably worth less than two cents. Feel free to ignore me.

It seems like every week there are bugfixes for relatively simple broken queries. For example, #6518, #6519, #6527, #6560, and #6577 are from the past week. As issues come in, a suite of test cases is gradually being developed, which is great. I wonder whether a polars should have a more comprehensive suite of logic tests before being considered mature.

mcrumiller commented 1 year ago

I prefer CalVer, but that is mainly because the concept of "breaking change" being either True or False doesn't sit right with me. Technically, if you issue a new release where an obscure function now uses ordinal arguments instead of keyword arguments, you have a breaking change, but that doesn't justify a major version bump. Breaking changes should be accompanied by some sort of quantifier, and that quantifier doesn't have to be embedded in the version number itself. Some sort of "breaking change matrix" would capture the idea well, which says how hard it would be to go from version X to version Y.

slonik-az commented 1 year ago

I would prefer CalVer as well. SemVer is great in theory but in practice I observed breaking changes introduced by minor versions.

concept of "breaking change" being either True or False doesn't sit right with me.

Fully agree. Breaking changes are way more nuanced that simply Yes/No.

alexander-beedie commented 1 year ago

I wonder whether a polars should have a more comprehensive suite of logic tests before being considered mature.

@s-banach: No, you're not wrong; we could be more pro-active here. The test suite, as developed, has definitely prioritised wide/shallow coverage in order to fit well with a fast pace of development; as we enter a new phase of development, it might not be a bad idea to rethink testing. I added a slower (but more comprehensive) set of testing primitives for polars last year (based on the hypothesis library), and I think we should start to make much more use of these, to more actively find edge-cases and improve coverage.

@ritchie46 / @stinodego: what would you think about starting to significantly expand the parametric tests, having them run automatically as part of a release? Wouldn't slow the development/patch velocity, and you could run them manually at any time, but should start to harden actual releases. Pretty much every time I've added a parametric test we've found edge-cases, so just the process of adding such coverage is likely to uncover plenty of small things.

(I've also been meaning to investigate the state-machine capabilities of hypothesis, in order to try automatically generating sequences of operations with different optimisation levels and confirming they converge on the same result, but there are only so many hours in the day...😅)

ritchie46 commented 1 year ago

I don't think parametric tests will help find these kind of bugs. Most of the edge cases are found not by the input data, but with the queries. And it is really hard to randomly generate valid queries. I also don't agree with the sentiment that it should stop us from being 1.0.

I rather have use spending time on fixing bugs and adding tests for those found.

Any library that keeps developing will introduce bugs. They should not stay on 0.x because of it. I think the edge case surface is infinite and if semantic versioning is seen as an inverse bug tracker I really think we should do calver.

But I don't see semantic versioning as such, so I think semver is fine if we commit to going to 2.0...n.0 in a regular interval

alexander-beedie commented 1 year ago

Ahh, I'm certainly not suggesting it holds-up a 1.0, but I do think we would benefit from more automated/generative testing, and an expansion of the parametric tests would definitely help provide that. (The state-machine approach is indeed non-trivial, but at some point I'll give it a damn good try ;)

alexander-beedie commented 1 year ago

https://jacobtomlinson.dev/posts/2023/sometimes-i-regret-using-calver/

Very good write-up.

Hmm... indeed, that is a well-argued piece 🤔

aptr322 commented 1 year ago

I'm surprised with .list() creating list of lists, somewhat counter-inuitive. I know there is more info which I haven't looked at, and should. I can say that using .list() became strange, like suddenly I needed to do arr().get(0) just to remove extra level

ritchie46 commented 1 year ago

I'm surprised with .list() creating list of lists, somewhat counter-inuitive. I know there is more info which I haven't looked at, and should. I can say that using .list() became strange, like suddenly I needed to do arr().get(0) just to remove extra level

Please don't hijack an issue with something off-topic. The list aggregation was incorrect before. See the rationale here #6487

ritchie46 commented 1 year ago

The point of no return from calver is a good one. I am inclined to stick with semver as proposed with regular breaking releases. If we don't stress on keeping these numbers low like we don't in calver or in 0.x I think it is fine.

aptr322 commented 1 year ago

didn't mean to, so you can disregard. It was quite big change and difference how other (spark, pandas) treat that. I'll look more at 6487

I do wish your project to succeed, it is great and very well done.

On Thu, Feb 2, 2023 at 11:22 PM Ritchie Vink @.***> wrote:

I'm surprised with .list() creating list of lists, somewhat counter-inuitive. I know there is more info which I haven't looked at, and should. I can say that using .list() became strange, like suddenly I needed to do arr().get(0) just to remove extra level

Please don't hijack an issue with something off-topic. The list aggregation was incorrect before. See the rationale here #6487 https://github.com/pola-rs/polars/pull/6487

— Reply to this email directly, view it on GitHub https://github.com/pola-rs/polars/issues/6616#issuecomment-1415209867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJC3VU44DZLUBHVELZ66UTWVSW2XANCNFSM6AAAAAAUNOEFYE . You are receiving this because you commented.Message ID: @.***>

aptr322 commented 1 year ago

I'm used to use path or git in cargo terms :) and don't mind 1.0, please do. that'ld be the step :)

On Thu, Feb 2, 2023 at 11:33 PM Ritchie Vink @.***> wrote:

The point of no return from calver is a good one. I am inclined to stick with semver as proposed with regular breaking releases. If we don't stress on keeping these numbers low like we don't in calver or in 0.x I think it is fine.

— Reply to this email directly, view it on GitHub https://github.com/pola-rs/polars/issues/6616#issuecomment-1415228849, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJC3VQYMEJJDDM5B2PSP4LWVSYC7ANCNFSM6AAAAAAUNOEFYE . You are receiving this because you commented.Message ID: @.***>

mkleinbort-ic commented 1 year ago

Two comments:

  1. I strongly prefer SemVer to CalVer for all the aforementioned reasons, and I look at the changelogs to know if I'll encounter any breaking changes. At the end of the day I get way more information from SemVer version numbers - at least at a quick glance. However, I don't much care if library maintainers break things in minor versions, just that they update major versions to tell me of major changes.

  2. Regarding 1.0 - Polars still feels experimental to me. I regularly write code and get "not what was inteded" results (that get patched very quickly <3). It'd personally take more of a Scikit-Learn approach and remain on 0.xx.xx for a long time, if only to indicate to users that the API is still in flux.

zundertj commented 1 year ago

I am also in favor of SemVer to CalVer. Just because there may occasionally slip in a minor breakage in a release that is not meant to be breaking, doesn't mean we should not try, the value add for users is large even if we get it right for just 95% of the changes we make. With CalVer, you give up on that altogether. It is also really useful for deprecations, you can communicate a future version number where it will be removed (you can't in CalVer, unless you make the promise you will put out a release at that day/month), and in upgrading code bases you know which intermediary versions to migrate to first as to lever the deprecation warnings.

On 1.0: I am indifferent on this.

radugrosu commented 1 year ago

@alexander-beedie "as an example, the number of breaking changes made to pandas since 1.0 is enormous but it is still on 1.x". Pandas has many faults but breaking changes are not amongst them.

Tomlinson's post is a very convincing argument against CalVer in polar's case. Regarding bumping to 1.0, from my user point of view, it feels a bit too soon - I just upgraded from 0.15 to 0.16 and it wasn't smooth. However, it's mostly the devs that know how close the api is to maturity. If you feel that the expected rate of breaking changes is about to slow down from now on, go for it.

shyamd commented 1 year ago

A little late to the party, but I'll add my intuition. One reason to like CalVer is that it liberates you from the "fear of making a breaking change" or at least indicating that there is a breaking change. It does this by removing your agency in incrementing already big numbers. With SemVer, there can be some level of decision paralysis around breaking changes and whether you want to up the major version or not.

My recommendation is combine both: use SemVer, and don't worry at all about adding breaking changes. Do it. Increment the major version and move on. If Polars is at v24.3.2 in a year, that's just an indication of a lot of important changes.

The last thing I'll add here is that I moved that decision logic to CI/CD in a few projects and just marked my PRs with the type of change. Once I stopped caring about what version I was on, I was able to ship breaking changes a lot easier.

MarcoGorelli commented 1 year ago

+1 to 1.0 and then yearly releases 🚀 It'd be a pity if someone didn't benefit from using polars in production just because of a "no-0.x" company policy

Alternatively, you might want to consider WhateVer (only joking here😆)

ritchie46 commented 1 year ago

Thanks for all the input people. Let's go for SemVer and 1.0 after 0.17 and promise to go to 2.0 within 6 months. I just want to see the semver as numbers like @shyamd is seeing them. That way we respect users expectations and SemVer itself.

I also saw some remarks that mentioned 1.0 promising a stable API. This is something I never want to promise for all API surface. So the that's why I see only two options:

We will add new features in the far future and we will make mistakes in doing so. By promising regular breaking changes we can fix those mistakes and we as maintainers don't have to stress as much about those numbers. After all they are just that... :)

stinodego commented 1 year ago

Sounds great! I will go through our API with a fine-tooth comb to pick out anything that is not ~pythonic~ in accordance with the Zen of Polars, and deprecate accordingly.

Then 0.17.0 will be as close to the desired API as possible, and 1.0.0 won't have to break much.

I'm also thinking that 0.17.0 could be the release where we stop supporting Python 3.7. Support ends in 4 months anyway, and it would be nice not to have to break that with our 1.0.0 release.

zundertj commented 1 year ago

Not sure if the intention is also to coordinate breaking changes we would want to do pre 1.0 in this thread? I guess that goes counter to the "sacred version" intention Ritchie has, which I fully get, but I guess it does not hurt to go through our existing issues with breaking ideas, and see if there are some that we can make without a huge effort and would save users coming in at 1.0 from a lot of churn when updating.

On that note, I would want to see #5429 go in for 0.17, and then we can remove the deprecation warnings for 1.0. Depending on timing of 0.17, I can do this myself, but happy for others to step in.

stinodego commented 1 year ago

Let's make an issue for that and pin it so that everyone can contribute!

stinodego commented 1 year ago

I ended up creating two GitHub milestones:

Feel free to assign issues / PRs to either milestone if you feel like it should be part of these releases. Or ping me on the if you don't have the rights to do so.

I think this will be a nice way to keep track of our intended breaking changes, and make sure we don't miss things when doing a breaking release.

@ritchie46 Feel free to shut this down if this is not the way you want to organize this!

vmgustavo commented 1 year ago

is it possible to watch/follow a milestone to get updates about the changes?

corneliusroemer commented 1 year ago

How's progress here? Some feedback: Polars is great, I love it, so much more intuitive than pandas and fast. But the still fairly frequent breaking changes to the API are off-putting to at least a good chunk of people and that's not in industry but academia.

So some sort of stability would be great. I prefer Semver over Calver as long as Semver is respected. If it isn't, then yes, use Calver rather than making a braking change in a patch release.

It'd be great if the renames of things like list->array were completed at some point and v1 was released. I get that you want to experiment with the API and realize some things should be different at some point. But it would be great if these breaking changes came not every month but maybe once every year or so with a new major version.

MarcoGorelli commented 1 year ago

Is there anything left blocking 1.0? The major issues I was aware of are resolved, and the issues tagged for 1.0 don't look like blockers

Might be worth setting a date, and anything that doesn't make it just goes in to 2.0 (6-12 months later)?

stinodego commented 1 year ago

There's nothing specifically blocking the 1.0 version. We're just a little nervous about the expectations that will come with it.

We had planned to go for 1.0 after the 0.18 version, however, we decided not to let the release of 1.0 coincide with the formation of the Polars company. We are planning for the next breaking release to be the 1.0 version.

With the release of 1.0, we will also clearly communicate that we are planning to continue to introduce breaking changes regularly (probably every 3 to 6 months). In practice, not much will change besides the breaking versions incrementing the major version (x.0.0) rather than the minor version (0.x.0).

s-banach commented 1 year ago

Maybe nobody else considers this a 1.0 blocker, but I'm really hoping for full pandas interop. pl.from_pandas can fail on pandas categorical columns, because polars can only handle string-dictionary types.

I'm trying to build tools at that use polars on the backend to process dataframes, but I still don't feel I can share these with my coworkers until I know that from_pandas will never fail.

mkleinbort-ic commented 1 year ago

My understanding of SemVer is:

0.0.x increments: all previously correct code will continue to run with no changes. Some bugs might have been fixed. 0.x.0 increments: there is new functionality with new APIs / upcoming API changes - but all correct code continues to work. x.0.0 increments: all bets are off - read the changelog.

If Polars' pace of change continues (eg. renaming groupby to group_by, pl.any to pl.any_horizontal), I imagine will be in v5.0.0 by the end of 2024 - but I'm ok with that.

Then again, I'm also ok with langchain being on version 0.0.301, so maybe this is all a bit arbitrary.

MarcoGorelli commented 1 year ago

groupby still continues to work - it would only be removed in 1.0

similarly, 1.x might introduce new deprecation warnings, but they would only be enforced in 2.0 (which Stijn has said would be at least 3-6 months after 1.0)

mkleinbort-ic commented 1 year ago

@MarcoGorelli

groupby still continues to work - it would only be removed in 1.0

I know, but historically Polars does break this things in 0.x.0 updates - for obvious reasons.

similarly, 1.x might introduce new deprecation warnings, but they would only be enforced in 2.0 (which Stijn has said would be at least 3-6 months after 1.0)

Sure. I like the rapid pace of polars development - a lot of awesome features being rapidly iterated on. I have no objection to frequent major releases.

corneliusroemer commented 1 year ago

@mkleinbort-ic I think you may have a slight misunderstanding of SemVer. In the "0.x.x" phase, which is considered "initial development," breaking changes can occur even in minor releases (like "0.x.0"), technically even in patches. It's only from "1.x.x" onwards that breaking changes are reserved strictly for major version increments.

From the specs:

  1. Version 1.0.0 defines the public API. The way in which the version number is incremented after this release is dependent on this public API and how it changes.
  2. Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.
mkleinbort-ic commented 1 year ago

Thank you for the correction. Makes sense.

stinodego commented 1 year ago

I'll close this, as the comments have been heard and a decision has been made. Expect 1.0.0rc1 to be released around year's end.

SebDeclercq commented 9 months ago

Is this 1.0.0rc1 still on the right track?

corneliusroemer commented 9 months ago

@SebDeclercq I haven't spotted an 1.0.0rc1 yet, @stinodego an update would be greatly appreciated 😃

stinodego commented 9 months ago

The next breaking release is still planned to be 1.0.0. We have been a bit delayed due to the great effort that went into changing the string type and ironing out some issues that came with that change.