Closed visnup closed 3 years ago
Should fileAttachment.xslx()
optionally take the same arguments as ExcelWorkbook.sheet(name, options)
as a kind of shorthand? And if you pass them, it calls .sheet
for you and returns that value? It would make the case of extracting a specific sheet and range a one-liner…
The big thing: documentation and context/motivations for this API are needed for this review. For example: How is the range option specified? Why do we support both a string form and an object form of ranges? How is the header option implemented (i.e., how do we define header rows)? How are formulas, rich text, and hyperlinked values represented, and why? etc.
Good point. I took a first stab at motivation and goals and documentation of sheet()
in the description. Will follow up on the other comments later.
@mbostock @Fil ok, this is ready for re-review. should be pretty close or final?
To add an idea while talking to @mbostock: in the future we could offer another option raw: true
that would return cell values as Objects, but with valueOf
and toString
methods which would coerce them back to the more primitive values we're returning here. That would give people an intermediate method of getting at more of the stored information without having to use ExcelJS or SheetJS directly.
One last question: I wonder if we should clean up the cells we use as headers?
Yeah I think we could trim them? Unsure about weird spacing inside them though? Or awkward punctuation too...
my preference would be to replace all \n \r by spaces, and trim
my preference would be to replace all \n \r by spaces, and trim
What about combine multiple whitespace characters inside the string into a single space?
Does d3-dsv do any type of trimming?
no… ok, then 8-)
no… ok, then 8-)
yeah ok, I think the default response is to encourage people to fix sources upstream and clean up those things in the xlsx files, then everyone benefits. I noticed a few values which could use trimming too and was tempted to do it in stdlib, so it's a slippery slope.
Can add later, but an idea re: the "#" field: I've been tempted to add a widths
property on the returned array for Inputs.table
to use. We could do it based on the actual widths in the spreadsheet too.
my preference would be to replace all \n \r by spaces, and trim
Why do you want to remove line breaks? Personally I've used them in headers quite a few times to keep long headers more readable/organized. (Sorry if this was already discussed.)
Edit: oops, look like this idea was already dropped in a later reply.
Alternative to #215
Handles dates correctly, using UTC. Based largely on @Fil's previous exploration.
The main goal of this PR is to give people a way to extract data out of xlsx files quickly, efficiently, and correctly. To do that, we assume people should be able to visually recognize and find the data they want to extract, leveraging any previous familiarity with the xlsx file. We don't want to spend much effort on preserving styling or presentation (widths, fonts, value formatting, merged cells, frozen panes) or features used during building a spreadsheet like formula definitions (only results are extracted). At the same time, we might decide some presentation features are worth preserving if we think it would help people trust and recognize the extracted contents of the spreadsheet (number formatting I'd guess could fall into this).
Also, we should make it easy to maintain extracting new or updated data from an updated or mutable file, which implies the importance of the unbounded range feature ("12:" to mean extract starting at row 12 to the end).
The extracted data should work well with the rest of the downstream toolchains in Observable. So, plain JavaScript values (
NaN
) over descriptive objects ({ error: "#DIV/0" }
).The
Workbook
API hopefully will be reusable to represent a Google Sheet just as well in the future.The assumed, frequent workflow of this API is that a user would pass all of the data first to
Inputs.table
or similar for exploratory recognition, then filter it down using therange
andheaders
options described below.Workbook.sheet(name, { range, headers }): Record<string, any>[]
Returns an array of objects representing the contents of cells in a specific sheet of the workbook. An example return value may look something like:
Empty cells are skipped: objects will not include fields or values for them. Empty rows are not skipped assuming they'll aid in data recognition. Values are coerced to their JavaScript types: numbers, strings, Date objects. Formula results are included, but formula definitions ignored. Row numbers from the source sheet are included to assist with range specification and recognition.
name: string | number is a sheet name to get data for. If it's a string, it must match a name in
Workbook.sheetNames
. You can pass a zero-indexed number to get the corresponding sheet in order ofWorkbook.sheetNames
. For example,sheet(0)
is the first sheet.range: string specifies a single rectangular range of cells to extract from the sheet as an Excel-based representation of a range, e.g.
"B4:L123"
to mean from cell B4 in the top left to L123 in the bottom right, inclusive. By default if no range is specified, the entire sheet is extracted.Similar to Excel, the row or column part of the start or end may be omitted to mean the entire row or column, e.g.
"4:123"
to mean rows 4 through 123 inclusive. Extending the standard syntax, you may omit a start or end specifier to mean "A1" or last column and last row, e.g."4:"
to mean row 4 to the end of the sheet.Union
"A1:B3,D1:G3"
and intersection"A1:C3 B2:D4"
specifiers are not supported.headers: boolean will treat the first extracted row as column headers and use their values as field names for returned objects. The default is false. If a value doesn't exist in the header row for a value, column names (A-ZZ) will be used instead. Underscores (_) are appended if field names are repeated.
With
{ range: "2:", headers: true }
, the above penguins data would be: