observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
957 stars 83 forks source link

Add missing filter options #331

Closed mkfreeman closed 1 year ago

mkfreeman commented 1 year ago

This is a small code change that warrants a detailed explanation. For context, the __table function creates JavaScript array filters that mimic SQL filter functions. This PR is concerned with the way we think about NULL values. The SQL code that gets constructed by the Table cell reads:

// If an "IS NULL" filter is applied to a column, add a SQL "IS NULL" statement
case "n":
        appendSql(` IS NULL`, args);
        return;

The corresponding JavaScript code currently reads:

// If an "IS NULL" filter is applied to a column, filter down to JavaScript `null` OR undefined values
case "n": {
        source = source.filter((d) => d[column] == null)

While this makes sense at face value, the concept of null means something different in JavaScript and SQL. In SQL, "A field with a NULL value is a field with no value." (source). In JavaScript, null is a primitive value which may appear in the dataset.

Currently, the way the __table function checks for null values doesn't align with the way the Observable Data Table cell checks for NULL/EMPTY. When computing summaries for the Summary Charts, the Data Table cell uses the following check:

 isMissing = (value) => {
  // Note that we use Number.isNaN because isNaN can return unexpected results
  // for non-numeric values.
  return value == null || value === "" || Number.isNaN(value);
};

Note, it checks if the value is null or undefined in the same way the standard library currently does, and also includes two additional checks: empty string (value === "") and NaN Number.isNaN(value). These checks importantly identify common JavaScript data patterns that are uncommon (or not possible?) in SQL.

A remaining question is: should we make any additional changes to the SQL code to ensure comparable behaviors?

See this notebook for additional context and interactive examples.

mbostock commented 1 year ago

We shouldn’t change the behavior of the existing n and nn operators. They have a clear meaning—is null and is not null—and the empty string is not the same as null in both SQL and JavaScript. Moreover, we shouldn’t change the definition of existing operators since doing so would retroactively affect existing notebooks (and it’s difficult to argue that this is merely fixing a bug).

I think if you want a query that matches NULL (null), the empty string, and NaN in JavaScript (and maybe also Invalid Date objects? I don’t recall how we represent those), it should probably be an in query, or perhaps an entirely new operator so that we can avoid changing the behavior of existing ones. I think it’s reasonable to map the JavaScript NaN value to be NULL in SQL, but I don’t think we should do the reverse because these are distinct types in JavaScript unlike SQL.

mkfreeman commented 1 year ago

Thanks, @mbostock - how would you feel about adding new missing (m) and notMissing (nm) queries in the __table function that perform the suggested checks, e.g.:

case "m": {
        source = source.filter(
          (d) =>
            d[column] == null || d[column] === "" || Number.isNaN(d[column])
        );
        break;
      }
case "nm": {
        source = source.filter(
          (d) =>
            d[column] != null && d[column] !== "" && !Number.isNaN(d[column])
        );

If we add those, it would surface two questions:

mkfreeman commented 1 year ago

What are the corresponding SQL queries (if any)?

Perhaps a missing m check for SQL is just NULL or '' (empty string):

case "m":
        appendSql(` IS NULL`, args);
        appendSql(` = ""`, args); // not sure if we need to use appendOperand before this
        break;
case "nm":
        appendSql(` IS NOT NULL`, args);
        appendSql(` !=""`, args);
mbostock commented 1 year ago

The SQL for the m query would presumably have to look like (x IS NULL OR x = '' OR x = float 'NaN')?

(I’m tempted to use d for “defined” and nd for “not defined” to avoid a double negative, but it doesn’t matter much since those names are only used internally.)

mkfreeman commented 1 year ago

That makes sense -- for the sake of conversation (see this notebook in progress) I've added the d and nd options to this PR (and changed the PR title).

mkfreeman commented 1 year ago

A brief explanation of Only check that the first value is a primitive type:

Because we want to support columns that put unmatched values into a "missing" bin, we no longer want to check that the first n (20) values are of the same type. For example, we want to support the array of strings ["a", "b", 1] that indicates 1 value not string. Without the above change, we would get an Invalid table data error pushed up.

mkfreeman commented 1 year ago

Important questions! I wanted to write out my thoughts before chatting. Feel free to comment here / in the notebook, or just wait until our meeting.

I'd say we're just moving the guard rails somewhere else 😅 .

mkfreeman commented 1 year ago

Corresponding monorepo PR that uses these filters.

mkfreeman commented 1 year ago

Here's a notebook with the isValid function tested against a number of values. Two observations that I found initially surprising (but I think are fine):

mkfreeman commented 1 year ago

Great point -- appreciate you took the time to write that out, will make that change soon and re-request a review!