vega / vega-transforms

Data processing transforms for Vega dataflows.
BSD 3-Clause "New" or "Revised" License
2 stars 6 forks source link

"fold" transform breaks for me after 3.0.10 #6

Closed martinvirtel closed 6 years ago

martinvirtel commented 6 years ago

I prepared a Gist:

View: https://bl.ocks.org/martinvirtel/62c6e98bd5e94cb01aabd598a3e8831e

Code: https://gist.github.com/martinvirtel/62c6e98bd5e94cb01aabd598a3e8831e

Same spec, but the "fold" transform cuts the data short. I tried it with 3.1 3.2 3.2.1 - same result.

I like Vega a lot, thanks for the work!

jheer commented 6 years ago

Thanks for the report! The issue arises from a subtle bug in our transform pipeline analysis. We will fix for the next release. In the meantime, you can also get a working spec by adding a collect transform to your pipeline, in between the aggregate and fold transforms.

For reference here is the corrected spec I'm using, which takes your original spec, adds an absolute URL to your data set (hosted in a gist), and interjects a collect transform within your timebins_test data source.

{
  "$schema": "https://vega.github.io/schema/vega/v3.0.json",
  "padding": 5,
  "width": 875,
  "height": 1400,
  "autosize" : "pad",
  "signals": [
    {
      "name": "format",
      "value": {
        "headline": { "fontSize": 38, "linelength": 35, "yoffset": -25 },
        "rightcolumn": { "width": 130, "opacity": 0.6 },
        "topbar": { "height": 130 },
        "bottombar": { "height": 130 }
      }
    },
    {
        "name" : "ressortfilter", "value" : "sp"
    }
  ],
  "scales": [
    {
      "name": "ressortscale",
      "type": "ordinal",
      "domain": ["sp", "ku", "vm", "wi", "pl"],
      "scheme": "category10"
    },
    {
      "name": "rowscale",
      "type": "band",
      "range": [
        { "signal": "format.topbar.height+((height-format.bottombar.height)/20)" },
        { "signal": "height - format.bottombar.height + ((height-format.bottombar.height)/20)" }
      ],
      "domain": [1, 2, 3, 4, 5, 6, 7, 8, 9,10]
    },
    {
      "name": "tablexscale",
      "type": "time",
      "range": [0, { "signal": "width- format.rightcolumn.width" }],
      "domain": { "data": "timebins", "field": "timebin" }
    },
    {
      "name": "areacharty",
      "type": "linear",
      "range": [{ "signal": "(height - (format.topbar.height + format.bottombar. height))/20" }, 0],
      "domain": { "data": "timebins", "field": "count" }
    }
  ],
  "data": [
    {
      "name": "csvdata",
      "format": { "type": "csv", "parse": { "timebin": "date" } },
      "url": "https://gist.githubusercontent.com/martinvirtel/62c6e98bd5e94cb01aabd598a3e8831e/raw/a2f6a36291c057b1c428c7b8c627a3aa885d000b/data.csv",
      "transform": [
        {
          "type": "formula",
          "expr": "(datum.p=='False' || datum.parent==datum.guid) ? null : datum.parent",
          "as": "package"
        }
      ]
    },
    { "name" : "found",
      "source" : "csvdata",
      "transform" : [
        {
          "type" : "filter",
          "expr" : "(ressortfilter == '*' ) || (datum.ressort == ressortfilter)"
        }
      ]
    },
    {
      "name": "packagestats",
      "source": "found",
      "transform": [
        {
          "type": "aggregate",
          "groupby": ["package", "guid"],
          "fields": ["guid", "ressort", "title_package", "title"],
          "ops": ["count", "max", "max", "max"],
          "as": ["guid_count", "ressort", "title_package", "title"]
        },
        {
          "type": "aggregate",
          "groupby": ["package", "ressort"],
          "fields": ["*", "guid_count", "ressort", "title_package", "title"],
          "ops": ["count", "sum", "max", "max", "max"],
          "as": ["urns_in_package", "links_found", "ressort", "title_package", "title"]
        }
      ]
    },
    {
      "name": "timebins_test",
      "source": "found",
      "transform": [
        { "type": "joinaggregate", "groupby": ["guid"], "ops": ["count"], "as": ["timebin_count"] },
        {
          "type": "aggregate",
          "groupby": ["guid", "timebin"],
          "fields": ["ratio", "ratio", "ressort", "title", "title_package", "timebin_count", "package", "package"],
          "ops": ["mean", "count", "max", "max", "max", "max", "max", "max"],
          "as": ["ratio", "count", "ressort", "title", "title_package", "timebin_count", "package", "package_urn"]
        },
        { "type": "collect" },
        { "type": "fold", "fields": ["guid", "package"] }
      ]
    },
    {
      "name": "timebins",
      "source": "found",
      "transform": [
        { "type": "joinaggregate", "groupby": ["guid"], "ops": ["count"], "as": ["timebin_count"] },
        {
          "type": "aggregate",
          "groupby": ["guid", "timebin"],
          "fields": ["ratio", "ratio", "ressort", "title", "title_package", "timebin_count", "package", "package"],
          "ops": ["mean", "count", "max", "max", "max", "max", "max", "max"],
          "as": ["ratio", "count", "ressort", "title", "title_package", "timebin_count", "package", "package_urn"]
        },
        { "type": "fold", "fields": ["guid", "package"], "as": ["type", "urn"] },
        { "type": "filter", "expr": "datum.type=='guid' || (datum.type=='package' && datum.package != null)" },
        { "type": "collect", "sort ": { "field": ["timebin", "urn"], "order": ["ascending", "ascending"] } },
        { "type": "extent", "field": "timebin", "signal": "timedomain" },
        { "type": "extent", "field": "timebin_count", "signal": "countdomain" }
      ]
    },
    {
      "name": "stories",
      "source": "timebins",
      "transform": [
        {
          "type": "aggregate",
          "groupby": ["urn"],
          "fields": ["ratio", "count", "ressort", "title", "title_package", "type", "package_urn", "timebin_count"],
          "ops": ["mean", "sum", "max", "max", "max", "max", "max", "max"],
          "as": ["ratio", "count", "ressort", "title", "title_package", "type", "package_urn", "timebin_count"]
        },
        {
          "type": "lookup",
          "from": "packagestats",
          "key": "package",
          "fields": ["urn"],
          "values": ["urns_in_package", "links_found"]
        },
        {
          "type": "formula",
          "expr": "datum.urns_in_package ? datum.urns_in_package : 1 ",
          "as": "urns_in_package"
        },
        {
          "type": "formula",
          "expr": "datum.type=='package' ? datum.links_found : datum.count ",
          "as": "count"
        },
        {
          "type": "filter",
          "expr": "datum.type=='guid' || (datum.type=='package' && datum.urns_in_package>2)"
        },
        {
          "type": "formula",
          "expr": "datum.type == 'package' ? '◎ ' + datum.title_package : datum.title ",
          "as": "title"
        },
        {
          "type": "formula",
          "expr": "datum.title.type == 'package' ? '◎ ' + datum.title_package : datum.title ",
          "as": "title"
        },
        {
          "type": "formula",
          "expr": "replace(datum.title,regexp('^(.{1,'+format.headline.linelength+'})( ((.*)|$))'),'$1')",
          "as": "t1"
        },
        {
          "type": "formula",
          "expr": "replace(datum.title,regexp('^(.{1,'+format.headline.linelength+'})( ((.*)|$))'),'$3')",
          "as": "t2"
        },
        { "type": "collect", "sort": { "field": ["count", "ratio"], "order": ["descending", "descending"] } },
        { "type": "window", "ops" : ["rank"]  }
      ]
    },
    {
      "name": "top10timebins",
      "source": "timebins",
      "transform": [
        { "type": "lookup", "from": "stories", "key": "urn", "fields": ["urn"], "as": ["story"] },
        { "type": "filter", "expr": "datum.story &&  datum.story.rank < 11" },
        { "type": "collect", "sort": { "field": ["timebin"], "order": ["ascending"] } }
      ]
    },
    {
      "name": "top10stories",
      "source": "stories",
      "transform": [{ "type": "filter", "expr": "datum.rank < 11" }]
    }
  ],
  "marks": [
    {
      "type": "group",
      "description": "table",
      "from": {
        "facet": {
          "data": "top10timebins",
          "name": "rows",
          "groupby": "urn",
          "aggregate": {
            "fields": ["story.rank", "title", "ressort"],
            "ops": ["max", "max", "max"],
            "as": ["rank", "stitle", "ressort"]
          }
        }
      },
      "encode": {
        "enter": {
          "y": { "field": "rank", "scale": "rowscale" },
          "x": { "value": 0 },
          "width": { "signal": "width" },
          "height": { "band": true, "scale": "rowscale" }
        },
        "update": {
          "y": { "field": "rank", "scale": "rowscale" },
          "x": { "value": 0 },
          "width": { "signal": "width" },
          "height": { "band": true, "scale": "rowscale" }
        }
      },
      "marks": [
        {
          "type": "area",
          "from": { "data": "rows" },
          "description": "fieberkurve",
          "encode": {
            "enter": {
              "x": { "field": "timebin", "scale": "tablexscale" },
              "y2": { "value": 0, "scale": "areacharty" },
              "y": { "field": "count", "scale": "areacharty" },
              "strokeWidth": [{ "test": "datum.story.type === 'package'", "value": 1 }, { "value": 1 }],
              "fillOpacity": { "signal": "format.rightcolumn.opacity" },
              "fill": { "field": "story.ressort", "scale": "ressortscale" },
              "stroke": { "field": "story.ressort", "scale": "ressortscale" },
              "strokeOpacity": { "value": 0.3 }
            },
            "update": {
              "x": { "field": "timebin", "scale": "tablexscale" },
              "y2": { "value": 0, "scale": "areacharty" },
              "y": { "field": "count", "scale": "areacharty" },
              "strokeWidth": [{ "test": "datum.story.type === 'package'", "value": 1 }, { "value": 1 }],
              "fillOpacity": { "signal": "format.rightcolumn.opacity" },
              "fill": { "field": "story.ressort", "scale": "ressortscale" },
              "stroke": { "field": "story.ressort", "scale": "ressortscale" },
              "strokeOpacity": { "value": 0.3 },
              "interpolate": { "value": "linear" }
            }
          }
        }
      ]
    },
    {
      "type": "group",
      "description": "table text",
      "from": { "data": "top10stories" },
      "encode": {
        "enter": {
          "y": { "field": "rank", "scale": "rowscale" },
          "x": { "value": 0 },
          "width": { "signal": "width" },
          "height": { "band": true, "scale": "rowscale" }
        },
        "update": {
          "y": { "field": "rank", "scale": "rowscale" },
          "x": { "value": 0 },
          "width": { "signal": "width" },
          "height": { "band": true, "scale": "rowscale" }
        }
      },
      "marks": [
        {
          "type": "text",
          "description": "Titel Zeile 1",
          "encode": {
            "enter": {
              "x": { "value": 0 },
              "y": { "signal": "0+format.headline.yoffset" },
              "text": { "signal": "parent.t1" },
              "strokeWidth": { "value": 0 },
              "fillOpacity": { "value": 0.8 },
              "fill": { "value": "#000000" },
              "fontWeight": { "value": "bold" },
              "fontSize": { "signal": "format.headline.fontSize" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "value": 0 },
              "y": { "signal": "0+format.headline.yoffset" },
              "text": { "signal": "parent.t1" },
              "strokeWidth": { "value": 0 },
              "fillOpacity": { "value": 0.8 },
              "fill": { "value": "#000000" },
              "fontWeight": { "value": "bold" },
              "fontSize": { "signal": "format.headline.fontSize" },
              "strokeOpacity": { "value": 1 }
            }
          }
        },
        {
          "type": "text",
          "description": "Titel Zeile 2",
          "encode": {
            "enter": {
              "x": { "value": 0 },
              "y": { "signal": "format.headline.fontSize*1.2+format.headline.yoffset" },
              "text": { "signal": "parent.t2" },
              "strokeWidth": { "value": 0 },
              "fillOpacity": { "value": 0.8 },
              "fill": { "value": "#000000" },
              "fontSize": { "signal": "format.headline.fontSize" },
              "limit": { "signal": "width*0.8" },
              "fontWeight": { "value": "bold" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "value": 0 },
              "y": { "signal": "format.headline.fontSize*1.2+format.headline.yoffset" },
              "text": { "signal": "parent.t2" },
              "strokeWidth": { "value": 0 },
              "fillOpacity": { "value": 0.8 },
              "fill": { "value": "#000000" },
              "fontSize": { "signal": "format.headline.fontSize" },
              "limit": { "signal": "width*0.8" },
              "fontWeight": { "value": "bold" },
              "strokeOpacity": { "value": 1 }
            }
          }
        },
        {
          "type": "rect",
          "description": "right column",
          "encode": {
            "enter": {
              "x": { "signal": "width-format.rightcolumn.width" },
              "width": { "signal": "format.rightcolumn.width" },
              "y2": { "value": 0, "scale": "areacharty" },
              "height": { "band": true, "scale": "rowscale" },
              "strokeWidth": { "value": 3 },
              "fillOpacity": { "signal": "format.rightcolumn.opacity" },
              "fill": { "signal": "parent.ressort", "scale": "ressortscale" },
              "stroke": { "value": "#ffffff" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "signal": "width-format.rightcolumn.width" },
              "width": { "signal": "format.rightcolumn.width" },
              "y2": { "value": 0, "scale": "areacharty" },
              "height": { "band": true, "scale": "rowscale" },
              "strokeWidth": { "value": 3 },
              "fillOpacity": { "signal": "format.rightcolumn.opacity" },
              "fill": { "signal": "parent.ressort", "scale": "ressortscale" },
              "stroke": { "value": "#ffffff" },
              "strokeOpacity": { "value": 1 }
            }
          }
        },
        {
          "type": "text",
          "description": "Rang",
          "encode": {
            "enter": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": -20 },
              "text": { "signal": "(parent.rank)+' '+parent.ressort" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 45 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": -20 },
              "text": { "signal": "(parent.rank)+' '+parent.ressort" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 45 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            }
          }
        },
        {
          "type": "text",
          "description": "Fundstellen, Prozent",
          "encode": {
            "enter": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": 20 },
              "text": { "signal": "parent.count +'x '+ format(parent.ratio*100,',.0f') + '%'" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 25 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": 20 },
              "text": { "signal": "parent.count +'x '+ format(parent.ratio*100,',.0f') + '%'" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 25 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            }

          }
        },
        {
          "type": "text",
          "description": "x Artikel",
          "encode": {
            "enter": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": 53 },
              "text": { "signal": "parent.urns_in_package + ' Artikel'" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 25 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            },
            "update": {
              "x": { "signal": "(width-format.rightcolumn.width)+(format.rightcolumn.width/2)" },
              "y": { "value": 53 },
              "text": { "signal": "parent.urns_in_package + ' Artikel'" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "strokeWidth": { "value": 1 },
              "fillOpacity": { "value": 1 },
              "fontSize": { "value": 25 },
              "fill": { "value": "#FFFFFF" },
              "strokeOpacity": { "value": 1 }
            }
          }
        }
      ]
    },
    {
      "type": "group",
      "description": "topbar",
      "encode": {
        "enter": {
          "x": 0,
          "y": 0,
          "width": { "signal": "width" },
          "height": { "signal": "format.topbar.height" },
          "fill": { "value": "#FF0000" },
          "opacity": { "value": 0.2 }
        }
      },
      "marks": [ ]
    },
    {
      "type": "group",
      "description": "bottombar",
      "encode": {
        "enter": {
          "x": 0,
          "y": { "signal": "height-format.bottombar.height" },
          "width": { "signal": "width" },
          "height": { "signal": "format.bottombar.height" },
          "fill": { "value": "#00FF00" },
          "opacity": { "value": 0.2 },
          "z-Index": { "value": 100 }
        }
      },
      "marks": [ ]
    }
  ]
}
jheer commented 6 years ago

Notes for future correction:

The fold transform serves as a tuple source, providing a materialized set of tuples that can be used by downstream operators and made available as pulse.source. However, the fold transform also requires access to an upstream source to generate the output tuple set -- something that transforms such as aggregate do not supply, requiring an intermediate collect transform to materialize the current tuple set.

The issue here exposes an error in our metadata tracking and logic (parsers/data.js in vega-parser). A necessary fix may be to update transform metadata definitions with additional tracking information.

martinvirtel commented 6 years ago

Thanks Jeffrey!

For the layman, data "in between" the transforms allways appears to have the same shape (array of objects), so that you can pipe the output of any transform as input any other. In that simple world, a loneley "collect" is a no-op. But that seems to be an oversimplification. Can you point me to a place in the docs where I can get a bit of insight into your thoughts, architecturewise?

jheer commented 6 years ago

One aspect to note is that data updates pass through operators (added, removed, or modified tuples) via a Pulse object, which also provides a way to access the full set of current tuples, if needed. Collectors are not a no-op as:

  1. Collect operators track all updates to materialize a "snapshot" of the current state of the data. When operators request access to the full data from a Pulse object, the backing array was often created/maintained by an upstream collect operator.
  2. Collect operators can also perform sorting on this snapshot.

For more, you might be interested in our research paper describing Vega's architecture. The paper was written relative to Vega 2, but the main concepts still apply: http://idl.cs.washington.edu/papers/reactive-vega-architecture

(In particular, see section 4.2: Changesets and Materialization.)

martinvirtel commented 6 years ago

Thanks for the pointer!

Are you aware of any live examples that show the update/add/delete capabilities of vega visualizations in action to show changes in the data (think "time-lapse")?

M

Jeffrey Heer notifications@github.com schrieb am Mo., 26. März 2018, 18:47:

One aspect to note is that data updates pass through operators (added, removed, or modified tuples) via a Pulse object, which also provides a way to access the full set of current tuples, if needed. Collectors are not a no-op as:

  1. Collect operators track all updates to materialize a "snapshot" of the current state of the data. When operators request access to the full data from a Pulse object, the backing array was often created/maintained by an upstream collect operator.
  2. Collect operators can also perform sorting on this snapshot.

For more, you might be interested in our research paper describing Vega's architecture. The paper was written relative to Vega 2, but the main concepts still apply: http://idl.cs.washington.edu/papers/reactive-vega-architecture

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vega/vega-transforms/issues/6#issuecomment-376233321, or mute the thread https://github.com/notifications/unsubscribe-auth/AJC91rOQj0BskwwjVLRoEpV0JOFI1C4Nks5tiRuJgaJpZM4Sp_xw .

jheer commented 6 years ago

Fixed in vega-transforms v1.3.1. Will be included in the next Vega release.