[feature request] Input (and maybe) output format handling multi-valued keys

johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

https://miller.readthedocs.io

Other

8.94k stars 216 forks source link

[feature request] Input (and maybe) output format handling multi-valued keys #635

Open trantor opened 3 years ago

trantor commented 3 years ago

Hello @johnkerl and everyone else.

In recent times I've sort of twisted Miller in order to use it to parse LDIF representations of data from an LDAP directory.

Apart from the fact that such a format presents, for instance, multi-line values as base64-encoded strings and that there are specific ways to encode characters outside a certain ASCII range in the keys, I've managed to make it work restricting the use of Miller only on the keys whose values are "simple" and using our trusty sed to substitute text around to make the input compatible with the XTAB format.

One thing I cannot work around, however, is the fact that objects in an LDAP directory can have multiple instances of the same key with different values each. One person might have multiple addresses, for instance. The way Miller processes XTAB input now leads to it keeping the last value found for a key within a record. What I'd like is for a way to handle multi-valued objects much in the way JSON arrays are handled, or by creating multi-valued columns with internal separators, à la nest, you might say. It might be a new data format or (maybe) an option to automate the process for all data formats wherever viable.

Does that make sense to you?

johnkerl commented 3 years ago

@trantor it makes total sense! :)

There is also https://github.com/johnkerl/miller/issues/225

So I think there could be three options for duplicate keys:

Last-found (the current behavior, and the default)
Modify keys as in https://github.com/johnkerl/miller/issues/225 -- e.g. on first x=a and then x=b, set x_2=b instead or somesuch
Concatenate values as in this issue -- e.g. on first x=a and then x=b, either produce x=a|b or x=[a,b] (array)

Maybe main Miller options:

mlr --on-dup-key overwrite (the default)
mlr --on-dup-key rename-key (x_2 etc.)
mlr -on-dup-key concatenate-with '|' (x=a|b)

johnkerl commented 3 years ago

It might be a new data format or (maybe) an option to automate the process for all data formats wherever viable.

And I am thinking the latter -- it would be done at record-scanning time for each file format.

johnkerl commented 2 years ago

@trantor I believe this is resolved by https://github.com/johnkerl/miller/pull/794 -- if I'm mistaken please let me know and we can re-open! :)

trantor commented 2 years ago

@johnkerl hey there. I've been absent for what appears to be a long while and, accordingly, I see an enormous amount of new features having been introduced. Great work! I am commenting here since it relates to this old issue, but I can open a new one if it makes sense. I've tried out miller 6 with the feature mentioned in #794, however it doesn't entirely make sense for my intended target. My objective was XTAB -> JSON, abusing (sort of) XTAB to handle LDIF representations of LDAP objects, where one could have multiple instances of the same key (e.g. to represent multiple e-mail addresses of an individual, for instance). That would play very nicely with the new flatten/unflatten features to get JSON arrays out, since they are supported now, however right now it doesn't work out of the box since the first instance of the key is not renamed as key_1 (and the flatsep is not the default ., but that's (maybe) minor). I can do a manual rename if it's a single key which I already know will appear multiple times, but in the case of multiple and unforeseen ones it's not very manageable. On top of that I've crashed against the (arbitrary?) hard limit of 1000 occurrences of duplicate keys set here https://github.com/johnkerl/miller/blob/46d013d44fdf27bd036b6584aaf3fbe87bbd9b96/internal/pkg/mlrval/mlrmap_accessors.go#L77 which sort of defeats the purpose. Do you think that these "problems" might be addressable? Thanks again and good job regardless :smile:

johnkerl commented 2 years ago

👀