nacnudus / unpivotr

Unpivot complex and irregular data layouts in R
https://nacnudus.github.io/unpivotr/
Other
183 stars 19 forks source link

New inverted grammar, starting with header cells #31

Open nacnudus opened 4 years ago

nacnudus commented 4 years ago

The current unpivotr grammar starts from the point of view of data cells, and searches for associated headers. This imitated databaker, because it is useful in the most common case (in my experience).

  1. The header cells surround the data cells.
  2. There are more different headers than you care to hardcode into a script

At long last, there is an example of a consistent schema that breaks (1) and doesn't suffer from (2).

Untidy data

image

Tidy version

image

Thoughts

  1. Locate each type of header by filtering, e.g. character == "Species:". Error if not unique (see step 4 for when whole tables repeat, as in the example).
  2. Describe the domain of the header over related data cells by its direction and limit, e.g. direction = "W" and limit = 1 or limit = Inf. Unlike the existing grammar, the direction is from the point of view of the header cell, rather than the data cells.
  3. Given a set of headers so described, unpivotr would resolve the data cells to the matching headers.
  4. If the whole table repeats, as in the example above, the same technique would apply as now -- identify a corner cell of each table, nest, and unpivot one at a time.
jl5000 commented 4 years ago

Do we know if there are any other datasets with this structure or if it's an evil one-off? I've never seen a structure like this before.

nacnudus commented 4 years ago

That's a reasonable point, although it isn't how nerd-sniping works :smile:

danstrobridge-Weston commented 3 years ago

I often get this sort of semi-structured format when working spreadsheets / text files generated by exporting pivoted tables from pdf. i'm eager to test the readr::melt functionality for dealing with it on my next project that can afford to pay me for some development time.