trelliscope / trelliscopejs-lib

JavaScript viewer for Trelliscope displays
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Fix CrossFilter to properly support all data types #777

Closed hafen closed 1 year ago

hafen commented 1 year ago

Doing this properly should fix #772 and #768.

  1. CrossFilter needs to handle missing values
  2. CrossFilter needs to handle all data types
  3. CrossFilter needs to put missing values at the end when sorting (regardless of direction)

1. CrossFilter needs to handle missing values

For (1), it should be noted that data coming into the app from will not have a value present for the column if it is missing.

For example, this dataset has three rows with column a numeric and column b string. Only row three has non-missing data for both variables. Row one is missing b and row 2 is missing a.

[
  {
    "a": 1
  },
  {
    "b": "val1"
  },
  {
    "a": 3,
    "b": "val2"
  }
]

Crossfilter does not like this.

In every instance where we set crossfilter.dimension(...), we must use a function that will return a valid value for the missing data (see (3) for more on what value to return if missing).

2. CrossFilter needs to handle all data types

Currently only string, number, and date are supported. We need to support all the other types, and should probably have a "catch-all" that coerces to string if somehow a new type ends up there.

Note that for type factor, we want the data to sort in the order of its levels. For example, suppose in the gapminder example the levels are specified in this order: ['Europe', 'Asia', 'Oceania', 'Africa', 'Americas']. When the user specifies to sort ascending, the panels should appear in the order specified by these levels. If descending, then it should be the reverse. I think the easiest way to do this would be if I were to encode these in JSON as 1 for "Europe", 2 for "Asia", etc. Then the factor sort dimension valueGetter could simply be numeric. The factor filter dimension value getter, on the other hand, would need to map the integer value to the appropriate factor level so that we are searching the factor as a string. Note that in this case we would also need to update how factor values are shown in the panel labels (similarly it's as simple as pulling the factor level based on the specified index).

3. CrossFilter needs to put missing values at the end when sorting

For any data type, we want missing values to show up at the end.

I accomplished this for numeric values in the old trelliscope library by setting missing values to the lowest or highest possible integer value (depending on the specified sort order) to ensure that they would always end up at the end. See here. If we are treating the underlying data for factor types as numeric then we can use that logic there as well.

I think similarly for strings, our sort dimensions can assign the lowest (' ') or highest ('~') ascii value to missing values so that the missing values always appear at the end.

hafen commented 1 year ago

I just added an example here: https://github.com/hafen/trelliscope-examples3/tree/main/gapminder_crossfilter

This provides missing values for many data types (string, factor, number, date, date time, href) for testing.

It also provides the new idea of encoding factors as integers. Note that the indexing is 1-based instead of 0-based so you will need to adjust accordingly.