Closed mkfreeman closed 1 year ago
We shouldn’t change the behavior of the existing n
and nn
operators. They have a clear meaning—is null and is not null—and the empty string is not the same as null in both SQL and JavaScript. Moreover, we shouldn’t change the definition of existing operators since doing so would retroactively affect existing notebooks (and it’s difficult to argue that this is merely fixing a bug).
I think if you want a query that matches NULL (null), the empty string, and NaN in JavaScript (and maybe also Invalid Date objects? I don’t recall how we represent those), it should probably be an in
query, or perhaps an entirely new operator so that we can avoid changing the behavior of existing ones. I think it’s reasonable to map the JavaScript NaN value to be NULL in SQL, but I don’t think we should do the reverse because these are distinct types in JavaScript unlike SQL.
Thanks, @mbostock - how would you feel about adding new missing
(m
) and notMissing
(nm
) queries in the __table
function that perform the suggested checks, e.g.:
case "m": {
source = source.filter(
(d) =>
d[column] == null || d[column] === "" || Number.isNaN(d[column])
);
break;
}
case "nm": {
source = source.filter(
(d) =>
d[column] != null && d[column] !== "" && !Number.isNaN(d[column])
);
If we add those, it would surface two questions:
is missing
and is not missing
options)?What are the corresponding SQL queries (if any)?
Perhaps a missing m
check for SQL is just NULL
or ''
(empty string):
case "m":
appendSql(` IS NULL`, args);
appendSql(` = ""`, args); // not sure if we need to use appendOperand before this
break;
case "nm":
appendSql(` IS NOT NULL`, args);
appendSql(` !=""`, args);
The SQL for the m
query would presumably have to look like (x IS NULL OR x = '' OR x = float 'NaN')
?
(I’m tempted to use d
for “defined” and nd
for “not defined” to avoid a double negative, but it doesn’t matter much since those names are only used internally.)
That makes sense -- for the sake of conversation (see this notebook in progress) I've added the d
and nd
options to this PR (and changed the PR title).
A brief explanation of Only check that the first value is a primitive type:
Because we want to support columns that put unmatched values into a "missing" bin, we no longer want to check that the first n
(20) values are of the same type. For example, we want to support the array of strings ["a", "b", 1]
that indicates 1 value not string. Without the above change, we would get an Invalid table data
error pushed up.
Important questions! I wanted to write out my thoughts before chatting. Feel free to comment here / in the notebook, or just wait until our meeting.
I'd say we're just moving the guard rails somewhere else 😅 .
Corresponding monorepo PR that uses these filters.
Here's a notebook with the isValid
function tested against a number of values. Two observations that I found initially surprising (but I think are fine):
object
columns: array
, new ArrayBuffer
, and date
are all valid (which makes sense, they're objects)other
columns: everything except null
and undefined
is valid (makes sense! I guess they're valid forms of "other")Great point -- appreciate you took the time to write that out, will make that change soon and re-request a review!
This is a small code change that warrants a detailed explanation. For context, the
__table
function creates JavaScript array filters that mimic SQL filter functions. This PR is concerned with the way we think about NULL values. The SQL code that gets constructed by the Table cell reads:The corresponding JavaScript code currently reads:
While this makes sense at face value, the concept of
null
means something different in JavaScript and SQL. In SQL, "A field with a NULL value is a field with no value." (source). In JavaScript,null
is a primitive value which may appear in the dataset.Currently, the way the
__table
function checks fornull
values doesn't align with the way the Observable Data Table cell checks forNULL/EMPTY
. When computing summaries for the Summary Charts, the Data Table cell uses the following check:Note, it checks if the value is
null
orundefined
in the same way the standard library currently does, and also includes two additional checks: empty string (value === ""
) and NaNNumber.isNaN(value)
. These checks importantly identify common JavaScript data patterns that are uncommon (or not possible?) in SQL.A remaining question is: should we make any additional changes to the SQL code to ensure comparable behaviors?
See this notebook for additional context and interactive examples.