observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
966 stars 83 forks source link

XLSX support with ExcelJS #248

Closed visnup closed 3 years ago

visnup commented 3 years ago

Alternative to #215

Handles dates correctly, using UTC. Based largely on @Fil's previous exploration.

Screen Shot 2021-09-10 at 3 45 19 PM

The main goal of this PR is to give people a way to extract data out of xlsx files quickly, efficiently, and correctly. To do that, we assume people should be able to visually recognize and find the data they want to extract, leveraging any previous familiarity with the xlsx file. We don't want to spend much effort on preserving styling or presentation (widths, fonts, value formatting, merged cells, frozen panes) or features used during building a spreadsheet like formula definitions (only results are extracted). At the same time, we might decide some presentation features are worth preserving if we think it would help people trust and recognize the extracted contents of the spreadsheet (number formatting I'd guess could fall into this).

Also, we should make it easy to maintain extracting new or updated data from an updated or mutable file, which implies the importance of the unbounded range feature ("12:" to mean extract starting at row 12 to the end).

The extracted data should work well with the rest of the downstream toolchains in Observable. So, plain JavaScript values (NaN) over descriptive objects ({ error: "#DIV/0" }).

The Workbook API hopefully will be reusable to represent a Google Sheet just as well in the future.

The assumed, frequent workflow of this API is that a user would pass all of the data first to Inputs.table or similar for exploratory recognition, then filter it down using the range and headers options described below.

Workbook.sheet(name, { range, headers }): Record<string, any>[]

Returns an array of objects representing the contents of cells in a specific sheet of the workbook. An example return value may look something like:

[ { A: "Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network." },
  { A: "Region", B: "Island", C: "Date Egg", D: "Culmen Length (mm)" },
  { A: "Anvers", B: "Torgersen", C: 2007-11-11, D: 39.1 },
  ...
]

Empty cells are skipped: objects will not include fields or values for them. Empty rows are not skipped assuming they'll aid in data recognition. Values are coerced to their JavaScript types: numbers, strings, Date objects. Formula results are included, but formula definitions ignored. Row numbers from the source sheet are included to assist with range specification and recognition.

name: string | number is a sheet name to get data for. If it's a string, it must match a name in Workbook.sheetNames. You can pass a zero-indexed number to get the corresponding sheet in order of Workbook.sheetNames. For example, sheet(0) is the first sheet.

range: string specifies a single rectangular range of cells to extract from the sheet as an Excel-based representation of a range, e.g. "B4:L123" to mean from cell B4 in the top left to L123 in the bottom right, inclusive. By default if no range is specified, the entire sheet is extracted.

Similar to Excel, the row or column part of the start or end may be omitted to mean the entire row or column, e.g. "4:123" to mean rows 4 through 123 inclusive. Extending the standard syntax, you may omit a start or end specifier to mean "A1" or last column and last row, e.g. "4:" to mean row 4 to the end of the sheet.

Union "A1:B3,D1:G3" and intersection "A1:C3 B2:D4" specifiers are not supported.

headers: boolean will treat the first extracted row as column headers and use their values as field names for returned objects. The default is false. If a value doesn't exist in the header row for a value, column names (A-ZZ) will be used instead. Underscores (_) are appended if field names are repeated.

With { range: "2:", headers: true }, the above penguins data would be:

[ { Region: "Anvers", Island: "Torgersen", "Date Egg": 2007-11-11, "Culmen Length (mm)": 39.1 },
  ...
]
visnup commented 3 years ago

Should fileAttachment.xslx() optionally take the same arguments as ExcelWorkbook.sheet(name, options) as a kind of shorthand? And if you pass them, it calls .sheet for you and returns that value? It would make the case of extracting a specific sheet and range a one-liner…

visnup commented 3 years ago

The big thing: documentation and context/motivations for this API are needed for this review. For example: How is the range option specified? Why do we support both a string form and an object form of ranges? How is the header option implemented (i.e., how do we define header rows)? How are formulas, rich text, and hyperlinked values represented, and why? etc.

Good point. I took a first stab at motivation and goals and documentation of sheet() in the description. Will follow up on the other comments later.

visnup commented 3 years ago

@mbostock @Fil ok, this is ready for re-review. should be pretty close or final?

visnup commented 3 years ago

To add an idea while talking to @mbostock: in the future we could offer another option raw: true that would return cell values as Objects, but with valueOf and toString methods which would coerce them back to the more primitive values we're returning here. That would give people an intermediate method of getting at more of the stored information without having to use ExcelJS or SheetJS directly.

visnup commented 3 years ago

One last question: I wonder if we should clean up the cells we use as headers?

Yeah I think we could trim them? Unsure about weird spacing inside them though? Or awkward punctuation too...

Fil commented 3 years ago

my preference would be to replace all \n \r by spaces, and trim

visnup commented 3 years ago

my preference would be to replace all \n \r by spaces, and trim

What about combine multiple whitespace characters inside the string into a single space?

visnup commented 3 years ago

Does d3-dsv do any type of trimming?

Fil commented 3 years ago

no… ok, then 8-)

visnup commented 3 years ago

no… ok, then 8-)

yeah ok, I think the default response is to encourage people to fix sources upstream and clean up those things in the xlsx files, then everyone benefits. I noticed a few values which could use trimming too and was tempted to do it in stdlib, so it's a slippery slope.

visnup commented 3 years ago

Can add later, but an idea re: the "#" field: I've been tempted to add a widths property on the returned array for Inputs.table to use. We could do it based on the actual widths in the spreadsheet too.

mootari commented 3 years ago

my preference would be to replace all \n \r by spaces, and trim

Why do you want to remove line breaks? Personally I've used them in headers quite a few times to keep long headers more readable/organized. (Sorry if this was already discussed.)

Edit: oops, look like this idea was already dropped in a later reply.