Closed mildbyte closed 1 year ago
Great job sleuthing a minimal repro 🎉
The issue originates from a bug in arrow-json v33.0.0 that we're currently on.
Namely, what happens is that when iterating over the record batches in arrow_json::writer::record_batches_to_json_rows
an auxiliary vector doesn't get sliced properly, leading to out of bounds access attempt. This has been fixed in newer arrow-json versions (see https://github.com/apache/arrow-rs/pull/3924 and https://github.com/apache/arrow-rs/pull/3934), so will pick it up eventually (should be in v36.0.0).
In the meantime I can add something along the following lines as a mitigation (could pose a problem for very large outputs as it doubles the size of total record batch rows/columns in memory):
@@ -106,11 +107,12 @@ async fn physical_plan_to_json(
context: Arc<DefaultSeafowlContext>,
physical: Arc<dyn ExecutionPlan>,
) -> Result<Vec<u8>, DataFusionError> {
+ let schema_ref = physical.schema();
let batches = context.collect(physical).await?;
let mut buf = Vec::new();
let mut writer = LineDelimitedWriter::new(&mut buf);
writer
- .write_batches(&batches)
+ .write_batches(&[concat_batches(&schema_ref, batches.iter())?])
.map_err(DataFusionError::ArrowError)?;
writer.finish().map_err(DataFusionError::ArrowError)?;
Ok(buf)
Running this query:
results in a
Trying to access an element at index 1 from a PrimitiveArray of length 1
panic:If I remove the
UNION ALL
or cast the timestamp to text before returning it, the error doesn't happen: