polydbms / sheetreader-duckdb

MIT License
38 stars 3 forks source link

Feature request: clean column names #54

Open kylelundstedt opened 1 day ago

kylelundstedt commented 1 day ago

Great work on the new (and more logically) named extension for reading in Excel files!!!

As a recovering R coder, I've used the R package janitor, which offers similar functions to sheetreader for reading in "cleaned up" Excel files. For example, its clean_names function will take the header row from an Excel worksheet, and it:

  • Parses letter cases and separators to a consistent format.
  • Default is to snake_case, but other cases like camelCase are available
  • Handles special characters and spaces, including transliterating characters like œ to oe.
  • Appends numbers to duplicated names
  • Converts “%” to “percent” and “#” to “number” to retain meaning
  • Spacing (or lack thereof) around numbers is preserved

My specific request for sheetreader-duckdb is the need for a clean_names equivalent in sheetreader to convert horrible Excel header names into nicely-formatted DuckDB columns names.

The ability to clean column names on import is particularly useful; my experience has been that it is really clunky to rename a large number of DuckDB columns post-import.