uwdata / arquero

Query processing and transformation of array-backed data tables.
https://idl.uw.edu/arquero
BSD 3-Clause "New" or "Revised" License
1.22k stars 64 forks source link

Apache Arrow 15 support #345

Closed rajsite closed 2 weeks ago

rajsite commented 4 months ago

Could dependencies be updated to support apache-arrow 15?

dioptre commented 3 months ago

This breaks compatibilty with arrow 15 (using table / toIPC breaks)

domoritz commented 3 months ago

Which methods specifically? Arrow 14 and 15 use the same IPC format and the APIs are very similar.

domoritz commented 3 months ago

I see what you mean. The binary format somehow changes: https://github.com/uwdata/arquero/pull/347.

dioptre commented 3 months ago

Haha nice to see you over here. Small world.

domoritz commented 3 months ago

I sent a pull request to change the default nullable for typed arrays back to not nullable (and also fixed a bug in null count): https://github.com/apache/arrow/pull/40852. I think we could either update to arrow 15 now with my fix to the tests and then later revert, or wait for arrow 16. Either way, arquero should work either version.

rajsite commented 3 months ago

I could make a separate issue for it, but it would also be helpful if arrow was a peer dependency instead of a direct dependency (helps make sure an app only ends up with a single copy of arrow shared across dependencies).

  "peerDependencies": {
    "apache-arrow": "^15.0.0"
  },

If arrow is good at backwards compatibility (although maybe the issues you ran into are reason enough to not assume that and need manual update across major versions) could consider a broader range specifier.

  "peerDependencies": {
    "apache-arrow": ">=15.0.0"
  },

That would be similar to the pattern of libraries like @geoarrow/geoarrow-js.

domoritz commented 3 months ago

The arrow ipc is backwards compatible. The arrow library is best effort backwards compatible. But yes, that's a separate issue for arquero. Why do we need a peer dependency? Couldn't we specify a broader range for the dependency so that npm can dedupe?

rajsite commented 3 months ago

It's handy for libraries that share constructed objects between each other. Example:

LibA {"dependencies": "arrow@^14"}
LibB {"dependencies": "arrow@^15"}
App depends on LibA and LibB
npm install: no warnings etc
result in:
node_modules/
   LibA
      node_modules/arrow@14.0
   LibB
      node_modules/arrow@15.0

Becomes an issue when in the app:

import {createArrow} from 'LibA';
import {useArrow} from 'LibB';

const myArrow = createArrow(); // constructed with v14 internally
useArrow(myArrow); // the object will be operated on by a library leveraging v15 internally

Versus with peerDependencies:

LibA {"peerDependencies": "arrow@^14"}
LibB {"peerDependencies": "arrow@^15"}
App depends on LibA and LibB
npm install: gives an error saying that peer dependencies cannot be satisfied among my used libraries

The app must either:

  1. Ask nicely that the libs update their support arrow range (which hopefully comes with testing from the libraries that the new major version doesn't break existing behavior)
  2. Manually define overrides in the App package.json to assume the responsibility of forcing the libs to use a shared version of arrow 14 or 15 or manually specify that they each use a separate version.
rajsite commented 3 months ago

Couldn't we specify a broader range for the dependency so that npm can dedupe?

Critical difference is, is it okay if npm sometimes doesn't dedupe or is it potentially an issue if multiple copies of the library exist in the app at different versions. For arrow as a dependency I think it's safer for end users to indicate that an app should really only resolve to a single shared copy of arrow and peerDependencies let you specify that.

Particularly because arquero does have APIs for sharing constructed instances of the shared arrow library (if it was completely internal then would make sense as just a dependency).