jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 221 forks source link

Writing chunked dictionary arrays to IPC currently impossible due to difference in key maps? #1554

Closed aldanor closed 10 months ago

aldanor commented 10 months ago

If you have a MutableDictionaryArray which you populate and flush once in a while and then create a new one for the next chunks, on the second chunk you try to write you will get:

InvalidArgumentError("Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.")

The problem is:

Am I missing something or is there a way to do it? (will be glad to open a PR if there's any suggestions on what's the proper way to fix it)

Might be somewhat related: https://github.com/jorgecarleitao/arrow2/issues/1485

aldanor commented 10 months ago

Actually, it looks like it's partially my misunderstanding, and it seems like you have to scan through the entire data first to build the dictionary array, and then do the second run to actually write it.

But then, again, the same question remains: