vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
331 stars 18 forks source link

Crash when coloring by string column #153

Closed jonmmease closed 2 years ago

jonmmease commented 2 years ago

The following Vega spec results in a PanicException when passed as input to pre_transform_spec

import vegafusion as vf
spec = r"""
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "background": "white",
  "padding": 5,
  "height": 200,
  "style": "cell",
  "data": [
    {
      "name": "df2",
      "values": [
        {
          "x": 1,
          "y": 1,
          "color": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
        },
        {"x": 1, "y": 1, "color": "BBBBBBBBBBBBBBBBBBBBBB"},
        {"x": 1, "y": 1, "color": "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"},
        {
          "x": 1,
          "y": 1,
          "color": "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
        }
      ]
    },
    {
      "name": "data_0",
      "source": "df2",
      "transform": [
        {
          "type": "stack",
          "groupby": ["x"],
          "field": "y",
          "sort": {"field": ["color"], "order": ["descending"]},
          "as": ["y_start", "y_end"],
          "offset": "zero"
        },
        {
          "type": "filter",
          "expr": "isValid(datum[\"y\"]) && isFinite(+datum[\"y\"])"
        }
      ]
    }
  ],
  "signals": [
    {"name": "x_step", "value": 20},
    {
      "name": "width",
      "update": "bandspace(domain('x').length, 0.1, 0.05) * x_step"
    }
  ],
  "marks": [
    {
      "name": "layer_0_marks",
      "type": "rect",
      "style": ["bar"],
      "from": {"data": "data_0"},
      "encode": {
        "update": {
          "fill": {"scale": "color", "field": "color"},
          "ariaRoleDescription": {"value": "bar"},
          "description": {
            "signal": "\"x: \" + (isValid(datum[\"x\"]) ? datum[\"x\"] : \"\"+datum[\"x\"]) + \"; y: \" + (format(datum[\"y\"], \"\")) + \"; color: \" + (isValid(datum[\"color\"]) ? datum[\"color\"] : \"\"+datum[\"color\"])"
          },
          "x": {"scale": "x", "field": "x"},
          "width": {"scale": "x", "band": 1},
          "y": {"scale": "y", "field": "y_end"},
          "y2": {"scale": "y", "field": "y_start"}
        }
      }
    }
  ],
  "scales": [
    {
      "name": "x",
      "type": "band",
      "domain": {"data": "data_0", "field": "x", "sort": true},
      "range": {"step": {"signal": "x_step"}},
      "paddingInner": 0.1,
      "paddingOuter": 0.05
    },
    {
      "name": "y",
      "type": "linear",
      "domain": {"data": "data_0", "fields": ["y_start", "y_end"]},
      "range": [{"signal": "height"}, 0],
      "nice": true,
      "zero": true
    },
    {
      "name": "color",
      "type": "ordinal",
      "domain": {"data": "data_0", "field": "color", "sort": true},
      "range": "category"
    }
  ],
  "axes": [
    {
      "scale": "y",
      "orient": "left",
      "gridScale": "x",
      "grid": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "x",
      "orient": "bottom",
      "grid": false,
      "title": "x",
      "labelAlign": "right",
      "labelAngle": 270,
      "labelBaseline": "middle",
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "grid": false,
      "title": "y",
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "zindex": 0
    }
  ],
  "legends": [{"fill": "color", "symbolType": "square", "title": "color"}]
}
"""
vf.runtime.pre_transform_spec(
    spec, "UTC"
)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-74-baf77c09bc66> in <cell line: 1>()
----> 1 vf.runtime.pre_transform_spec(
      2     spec, "UTC"
      3 )

~/.cache/pypoetry/virtualenvs/python-kernel-OtKFaj5M-py3.9/lib/python3.9/site-packages/vegafusion/runtime.py in pre_transform_spec(self, spec, local_tz, default_input_tz, row_limit, inline_datasets)
    111         else:
    112             inline_dataset_bytes = self._serialize_inline_datasets(inline_datasets)
--> 113             new_spec, warnings = self.embedded_runtime.pre_transform_spec(
    114                 spec,
    115                 local_tz=local_tz,

PanicException: Failed to get node value: ExternalError("task 191 panicked", ErrorContext { contexts: ["tokio error"] })

Seemingly small changes to the number of characters in the strings in the color column change whether the exception occurs.

jonmmease commented 2 years ago

The thread-level backtrace when running the spec in Rust has more info:

thread 'tokio-runtime-worker' panicked at 'range end index 75 out of range for slice of length 72', library/core/src/slice/index.rs:73:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/panicking.rs:142:14
   2: core::slice::index::slice_end_index_len_fail_rt
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/slice/index.rs:73:5
   3: core::ops::function::FnOnce::call_once
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/ops/function.rs:248:5
   4: core::intrinsics::const_eval_select
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/intrinsics.rs:2695:5
   5: core::slice::index::slice_end_index_len_fail
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/slice/index.rs:67:9
   6: <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index_mut
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/core/src/slice/index.rs:315:13
   7: core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/alloc/src/vec/mod.rs:2639:9
   8: <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut
             at /rustc/7665c3543079ebc3710b676d0fd6951bedfd4b29/library/alloc/src/vec/mod.rs:2639:9
   9: datafusion_row::writer::RowWriter::set_utf8
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/row/src/writer.rs:236:9
  10: datafusion_row::writer::write_field_utf8
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/row/src/writer.rs:356:5
  11: datafusion_row::writer::write_field
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/row/src/writer.rs:397:17
  12: datafusion_row::writer::write_row
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/row/src/writer.rs:286:17
  13: datafusion::physical_plan::aggregates::row_hash::create_group_rows
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/core/src/physical_plan/aggregates/row_hash.rs:414:9
  14: datafusion::physical_plan::aggregates::row_hash::group_aggregate_batch
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/core/src/physical_plan/aggregates/row_hash.rs:228:40
  15: <datafusion::physical_plan::aggregates::row_hash::GroupedHashAggregateStreamV2 as futures_core::stream::Stream>::poll_next
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/core/src/physical_plan/aggregates/row_hash.rs:161:34
...

In particular, these lines:

9: datafusion_row::writer::RowWriter::set_utf8
             at /home/jmmease/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/57f47ab/datafusion/row/src/writer.rs:236:9

So it looks like something is causing an index out of bounds access in DataFusion's row logic.

jonmmease commented 2 years ago

Oh, good news. This is a bug that was recently fixed in DataFusion by https://github.com/apache/arrow-datafusion/pull/2968. After updating the DataFusion dependency this spec works as expected.