jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

Add deprecation notice (#1429) #1464

Closed tustvold closed 1 year ago

tustvold commented 1 year ago

Closes #1429

Following on from #1446 I think all that remains is to provide some form of communication to users regarding the relationship of the projects going forward. I believe what I've written covers the consensus from the various issues and discussions on the topic, but my intent here is purely to kickstart the conversation, not to dictate the message :sweat_smile:

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.02 :warning:

Comparison is base (33f6ba1) 83.93% compared to head (bf21ee6) 83.92%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1464 +/- ## ========================================== - Coverage 83.93% 83.92% -0.02% ========================================== Files 387 387 Lines 41596 41596 ========================================== - Hits 34913 34908 -5 - Misses 6683 6688 +5 ``` [see 5 files with indirect coverage changes](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1464/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ritchie46 commented 1 year ago

I am not entirely sure about this. I totally agree on working together towards a similar memory/crate interface.

But as it stands this deprecates the whole arrow2 API, which is used by polars and is deeply entrenched in the project. I am fine with pointing users tot he official arrow-rs project, but I want to keep maintaining the arrow2 API.

sundy-li commented 1 year ago

We tried to migrate to arrow-rs last week. But we found the untyped ArrayData and NullBuffer of arrow-rs are harder to use than arrow2's Buffer<T> and Bitmap.

tustvold commented 1 year ago

are harder to use than arrow2's Buffer and Bitmap.

I'm actively working on porting the remainder of this across in https://github.com/apache/arrow-rs/issues/3879 perhaps you might be able to lend a hand identifying any missing functionality?

alamb commented 1 year ago

But as it stands this deprecates the whole arrow2 API, which is used by polars and is deeply entrenched in the project. I am fine with pointing users tot he official arrow-rs project, but I want to keep maintaining the arrow2 API.

@ritchie46 can you help with the alternate wording? I didn't realize you still planned to keep maintaining the arrow2 crate. Maybe we have different ideas of what "maintaining" means 🤔

I think the best thing for users of arrow2 would be to be as explicit as possible about the state of the crate. For example:

  1. Will PRs be reviewed and accepted? (if so, by who?)
  2. Will new releases be made? (if so, by who? On what schedule?)

If you can provide some guidance on these issues I can take a shot at writing a section to summarize

ritchie46 commented 1 year ago

@alamb polars is heavily depended on arrow2's typed arrays/offsets/buffer interfaces.

If we are able have this API on top of arrow-rs memory, I can progressively refactor to arrow-rs, but this should be a slow progress as this is not top priority for polars. My ideal situation is a way to improve the memory interop first (this is what @tustvold has been adding, so that is a great start).

As I am aware, there is currently not away to go forward whilst using arrow2's array API. Before this should be deprecated, there should be such an interface as a zero-cost abstraction on top of arrow-rs's buffers IMO.

If that is not something that is wanted by arrow-rs, that might also be fine. Maybe polars should fork arrow2 and continue maintaining it less publicly.

I am just scared of the core dependency being pulled from underneath of polars. Releases are still needed as I want bugs to be fixed and be able to publish on crates.io.

tustvold commented 1 year ago

should be such an interface as a zero-cost abstraction on top of arrow-rs's buffers IMO

How zero-cost does this need to be? I guess my thinking was that #1446 was sufficiently low overhead as it doesn't copy the underlying buffers, and so would be sufficient to bridge between migrated and unmigrated parts of a codebase? What does a couple of additional reference counts matter when you have a kernel that is processing an array of thousands of rows? Perhaps I am missing something here?

The largest overhead at the moment is likely schema mapping, switching arrow2 to using the same schema types is definitely tractable should this prove a bottleneck.

there is currently not away to go forward whilst using arrow2's array API

My intention with this PR was not to say people should stop using these APIs today, I'm not suggesting we archive this repo :sweat_smile:, but rather to communicate a direction of travel to the community. I'm not overly concerned about timescales, if it takes 6 months it takes 6 months, I just would like to be able to set both projects on a path towards unification and clearly communicate this, along with any implications.

alamb commented 1 year ago

I certainly don't mean in any way to stop or prevent maintenance of arrow2 for people who want to do so.

As he says, @tustvold and I and the rest of the Arrow community plan to keep pushing arrow-rs forward and are trying to assist users of arrow2 who want to join us.

I am just scared of the core dependency being pulled from underneath of polars. Releases are still needed as I want bugs to be fixed and be able to publish on crates.io.

The purpose of this particular PR to change the documentation is as @tustvold says, to communicate that there seems to be no maintainers or plan for maintenance of this crate. I certainly don't plan to "pull" anything from arrow2 or polars.

If that is not something that is wanted by arrow-rs, that might also be fine. Maybe polars should fork arrow2 and continue maintaining it less publicly.

I would personally love to get the zero copy approach in arrow-rs.

In terms of maintaining a polars specific fork of arrow2 that certainly is a reasonable approach (or since you are a maintainer of this repo, you could maybe just use this one). Also perhaps you or other interested parties could work out an alternate maintenance plan for this crate with its other users.

I think it is up to you where you want to invest your time and effort.

alamb commented 1 year ago

Here is a PR that clarifies that the maintenance bandwidth on this crate is low: https://github.com/jorgecarleitao/arrow2/pull/1476

It side steps the issues of what "deprecated" might mean / not mean for the future and instead attempts to describe the current state of affairs.

I also hope it might help any current users of arrow2 to offer to help maintain the crate if that is valuable to them.

ritchie46 commented 1 year ago

See #1476