observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
957 stars 83 forks source link

Infer schemas and coerce data for table cells #346

Closed libbey-observable closed 1 year ago

libbey-observable commented 1 year ago

Resolves https://github.com/observablehq/observablehq/issues/9862 and https://github.com/observablehq/observablehq/issues/9673

In this inference approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then, if > 90% of the sampled values conform to that type, we take the most frequently encountered type as the column's type. If not, we use type "other."

These changes include some support for upcoming column type assertions, but for the purposes of this PR, I'm testing against main.

https://user-images.githubusercontent.com/111310561/214719295-8326f325-879b-4b8a-9d9f-2397b28951de.mov

mkfreeman commented 1 year ago

In the (soon to be outdated) schema inference in computeSummaries, we iterate through all of the rows in the dataset to ensure that we get all the keys:

// We need to show *all* keys present in the array of Objects
function getAllKeys(rows) {
  const keys = new Set();
  for (const row of rows) {
    // avoid crash if row is null or undefined
    if (row) {
      // only enumerable properties
      for (const key in row) {
        // only own properties
        if (Object.prototype.hasOwnProperty.call(row, key)) {
          // unique properties, in the order they appear
          keys.add(key);
        }
      }
    }
  }
  return Array.from(keys);
}

I don't see a comparable replacement for that in this PR - are we foregoing that step moving forward (and if so, how will we handle columns that don't get typed because they aren't present in the sample...)?

libbey-observable commented 1 year ago

In the (soon to be outdated) schema inference in computeSummaries, we iterate through all of the rows in the dataset to ensure that we get all the keys: I don't see a comparable replacement for that in this PR - are we foregoing that step moving forward

Added, thanks @mkfreeman!

mkfreeman commented 1 year ago

We'll need to handle the case when a user asserts that an array of primitives is a type that, when coerced, returns all null values. For example, if the data is d3.range(10), and the user selected the type boolean.

This will currently break because arrayIsPrimitive will return false, and the worker won't convert the primitive values into objects:

// This code path will not be entered in the case described above
let value = cellValue;
if (arrayIsPrimitive(cellValue)) { // not true if an array of numbers is set to boolean
  value = Array.from(cellValue, (d) => ({value: d}));
  value.schema = cellValue.schema;
}

I suggest we handle this at the stdlib level my adding the following code:

export function arrayIsPrimitive(value) {
  return (
    isTypedArray(value) ||
    arrayContainsPrimitives(value) ||
    arrayContainsDates(value) ||
    arrayContainsNulls(value) // checks if all values are null
  );
}

// Given an array, checks that the first n elements are all null or undefined.
// This allows an array of all null or undefined values to be handled in the
// table cell, such as if a user asserts a type that coerces all array values to
// null
export function arrayContainsNulls(value) {
  const n = Math.min(nChecks, value.length);
  if (!(n > 0)) return false;
  return value.every(element => element == null)
}
libbey-observable commented 1 year ago

We'll need to handle the case when a user asserts that an array of primitives is a type that, when coerced, returns all null values. For example, if the data is d3.range(10), and the user selected the type boolean.

This will currently break because arrayIsPrimitive will return false, and the worker won't convert the primitive values into objects:

Thanks for raising this, @mkfreeman! I see what you mean about arrays of null values. If we want to treat those as arrays as primitives, we may want to stop ignoring them in the arrayIsPrimitive function, maybe something like what's shown below. But I'd be curious to hear @mbostock's thoughts on whether we do want to treat them that way?

Screen Shot 2023-02-01 at 8 54 15 AM

mbostock commented 1 year ago

I’ve been wondering if we should drop support for arrays of primitives since it seems to be adding a lot of complexity…

mkfreeman commented 1 year ago

I’ve been wondering if we should drop support for arrays of primitives since it seems to be adding a lot of complexity…

As a DTC user, I feel strongly that we should continue supporting arrays of primitives, as it allows a quick visual summary of arrays (of primitives...).

In terms of your proposed solution @libbey-observable, I think we need arrayContainsPrimitives to return true if the values are a mix of a primitive type and null/undefined. We could support that via a different function (my suggestion above), or I think we'd need to tweak the switch statement to add a new case for undefined (or the case when the type is object for null values), which feels a little tricky because we don't want to set the type to null unless they are all null.

annie commented 1 year ago

(related to primitive arrays but sorry if this is too tangential for this PR) i was wondering why we coerce primitive arrays to {value} in the worker, if we're already doing it here? i see it in both requestTableSummary and requestTableRows. is that code redundant, or does it cover some other case?

libbey-observable commented 1 year ago

i was wondering why we coerce primitive arrays to {value} in the worker, if we're already doing it here? i see it in both requestTableSummary and requestTableRows. is that code redundant, or does it cover some other case?

Thanks, @annie! I was actually just wondering the same thing. When I try removing those steps from worker/table.js, at least with the stdlib changes here (where an array of primitives gets promoted to objects on load), I'm not seeing any issues. I think we may be able to remove those, but would maybe wait until these stdlib changes are released, if we do go this route. And would test more.