Infer schema for relevant data sources

observablehq / stdlib

The Observable standard library.

https://observablehq.com/@observablehq/standard-library

ISC License

957 stars 83 forks source link

Infer schema for relevant data sources #344

Closed libbey-observable closed 1 year ago

libbey-observable commented 1 year ago

In this approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then we take the most frequently encountered type as the column's type. We could also add some random sampling.

Note that in the screenshots, the data in the columns has not been coerced, this is an in-between point, where we've inferred types, but not yet applied them.

From a CSV file: Screen Shot 2023-01-19 at 3 51 20 PM

From a JSON file: Screen Shot 2023-01-19 at 3 36 49 PM

mkfreeman commented 1 year ago

This is looking great! For reference, here's how we do the random sampling for getting string lengths - it includes the first 20 rows (because they are what the user sees -- perhaps not necessary here), and randomly samples 100 values (using a seed so the random values are always the same). https://github.com/observablehq/observablehq/blob/main/notebook-next/src/worker/computeSummaries.js#L86

libbey-observable commented 1 year ago

@mbostock Thanks again for the valuable feedback – the issues you mentioned have been addressed in https://github.com/observablehq/stdlib/pull/346. Closing this PR in favor of that.