Open trantor opened 3 years ago
@trantor it makes total sense! :)
There is also https://github.com/johnkerl/miller/issues/225
So I think there could be three options for duplicate keys:
x=a
and then x=b
, set x_2=b
instead or somesuchx=a
and then x=b
, either produce x=a|b
or x=[a,b]
(array)Maybe main Miller options:
mlr --on-dup-key overwrite
(the default)mlr --on-dup-key rename-key
(x_2
etc.)mlr -on-dup-key concatenate-with '|'
(x=a|b
)It might be a new data format or (maybe) an option to automate the process for all data formats wherever viable.
And I am thinking the latter -- it would be done at record-scanning time for each file format.
@trantor I believe this is resolved by https://github.com/johnkerl/miller/pull/794 -- if I'm mistaken please let me know and we can re-open! :)
@johnkerl hey there. I've been absent for what appears to be a long while and, accordingly, I see an enormous amount of new features having been introduced. Great work!
I am commenting here since it relates to this old issue, but I can open a new one if it makes sense.
I've tried out miller 6 with the feature mentioned in #794, however it doesn't entirely make sense for my intended target.
My objective was XTAB -> JSON, abusing (sort of) XTAB to handle LDIF representations of LDAP objects, where one could have multiple instances of the same key (e.g. to represent multiple e-mail addresses of an individual, for instance).
That would play very nicely with the new flatten/unflatten features to get JSON arrays out, since they are supported now, however right now it doesn't work out of the box since the first instance of the key is not renamed as key_1
(and the flatsep is not the default .
, but that's (maybe) minor). I can do a manual rename if it's a single key which I already know will appear multiple times, but in the case of multiple and unforeseen ones it's not very manageable.
On top of that I've crashed against the (arbitrary?) hard limit of 1000 occurrences of duplicate keys set here https://github.com/johnkerl/miller/blob/46d013d44fdf27bd036b6584aaf3fbe87bbd9b96/internal/pkg/mlrval/mlrmap_accessors.go#L77
which sort of defeats the purpose.
Do you think that these "problems" might be addressable?
Thanks again and good job regardless :smile:
👀
Hello @johnkerl and everyone else.
In recent times I've sort of twisted Miller in order to use it to parse LDIF representations of data from an LDAP directory.
Apart from the fact that such a format presents, for instance, multi-line values as base64-encoded strings and that there are specific ways to encode characters outside a certain ASCII range in the keys, I've managed to make it work restricting the use of Miller only on the keys whose values are "simple" and using our trusty
sed
to substitute text around to make the input compatible with the XTAB format.One thing I cannot work around, however, is the fact that objects in an LDAP directory can have multiple instances of the same key with different values each. One person might have multiple addresses, for instance. The way Miller processes XTAB input now leads to it keeping the last value found for a key within a record. What I'd like is for a way to handle multi-valued objects much in the way JSON arrays are handled, or by creating multi-valued columns with internal separators, à la
nest
, you might say. It might be a new data format or (maybe) an option to automate the process for all data formats wherever viable.Does that make sense to you?