pola-rs / nodejs-polars

nodejs front-end of polars
https://pola-rs.github.io/nodejs-polars/
MIT License
383 stars 40 forks source link

[NodeJS]: readIPC from buffer fails with 'Arrow file does not contain correct header', while it works in ArrowJS #109

Open 0xgeert opened 2 years ago

0xgeert commented 2 years ago

Using Node.JS

What version of polars are you using?

"nodejs-polars": "^0.2.0"

What operating system are you using polars on?

MacOS Big Sur 11.1

Describe your bug.

Reading in a buffer from an .ipc (ArrowStream) file using readIPC fails with Error: Arrow file does not contain correct header. At the same time the file is not corrupt since it can be loaded using apache-arrow's Table.from method

What are the steps to reproduce the behavior?

See code example below. I'll post both the .arrow file (works) and .ipc file (doesn't work) as attachment

const pl = require('nodejs-polars'); 
const { Table } = require('apache-arrow')
const { readFileSync } = require('fs');

const fromArrow = readFileSync('hits.arrow'); 
const fromIPC = readFileSync('hits.ipc'); 

// Read Arrow file by Arrow.js -> works
const df = Table.from([fromArrow])
console.log("df", df.count()) // 10

// Read Arrow file by polars -> works
const dfPolars = pl.readIPC(fromArrow)
console.log("dfPolars", dfPolars) // prints nice table with 10 entries

// Read IPC (ArrowStream) file by Arrow.js -> works
const dfIpc = Table.from([fromIPC])
console.log("dfIpc", dfIpc.count()) // 10

// Read IPC (ArrowStream) by polars -> Fails
const dfIpcPolars = pl.readIPC(fromIPC)
console.log("dfIpcPolars", dfIpcPolars) // Error: Arrow file does not contain correct header
0xgeert commented 2 years ago
ritchie46 commented 2 years ago

The IPC readers are implemented upstream. Could you make this issue here? https://github.com/jorgecarleitao/arrow2

jorgecarleitao commented 2 years ago

I am a bit surprised about pl.readIPC(fromArrow) and pl.readIPC(fromIPC): shouldn't these be two different signatures? One thing is to read a stream (.ipc), the other is a file (.arrow). I think that we are just missing a readIPCStream in Polars' API that can read arrow streams (as opposed to arrow files).

ritchie46 commented 2 years ago

Ah.. Polars doesn't have that distinction no. So the IPC is the stream and the .arrow is the feather file as the IPC data + additional headers?

Then we must add this.

joshuataylor commented 2 years ago

Hi!

I'm keen to get this into polars, as Snowflake uses this for their response format and would be awesome to get it in for reading data straight from SF into Polars.

Here is a quick primer about the streaming files from Arrow: https://arrow.apache.org/docs/python/ipc.html And the guide here from arrow2 about reading the stream: https://jorgecarleitao.github.io/arrow2/io/ipc_stream_read.html

IMHO, supporting files initially is fine, later can do other streaming support.

I've started looking into this, and the major blocker I can see is projections.

In arrow2, projections are not supported here: https://github.com/jorgecarleitao/arrow2/blob/main/src/io/ipc/read/stream.rs#L185

So we will need to build the projection from the chunks.

Thoughts?

stinodego commented 1 year ago

Transfering this to the NodeJS repo as I have no way to reproduce this using Python/Rust. Not sure if this is still relevant.

Bidek56 commented 1 month ago

@0xgeert Please try: pl.read_ipc_stream using py-polars as described here. It works fine for me. Thx