vectordotdev / vrl

Vector Remap Language
Mozilla Public License 2.0
127 stars 57 forks source link

New `to_hive_partition` Remap function #139

Open binarylogic opened 3 years ago

binarylogic commented 3 years ago

Creating Hive partition strings is very common when writing to file-like storages (such as aws_s3). Unfortunately, creating these partition strings is fraught with foot-guns. To protect users from these issues we should offer a function that makes this task easy.

Examples

Given this event for all examples:

{
    "timestamp": "2021-01-14T21:26:45.433667Z",
    "application_id": 2,
    "environment": "production"
}

Array

And this Remap script:

to_hive_partition([.environment, .application_id, .timestamp], limit: 256)

Would produce this string:

environment=production/application_id=2/timestamp_year=2020/timestamp_month=01/timestamp_day=14/

Notice that:

  1. The keys reflect the path names (curious if this is possible)
  2. The timestamp is opinionated: truncated by the day and split into 3 partitions

Map

And this Remap script:

to_hive_partition({"env": .environment, "app": .application_id, "ts": .timestamp}, limit: 256)

Would produce this string:

env=production/app=2/ts_year=2020/ts_month=01/ts_day=14/

Notice that the map keys are used as the names

Single value

And this Remap script:

to_hive_partition(.timestamp, limit: 256)

Would produce this string:

timestamp_year=2020/timestamp_month=01/timestamp_day=14/

Requirements

cc @jszwedko since he had the pleasure of creating such partition strings for the benchmarking work.

binarylogic commented 3 years ago

Blocking this with needs labels until we can firm up the requirements. I'm expecting @jszwedko to suggest changes 😁 .

jszwedko commented 3 years ago

:smile:

I think the requirements seem mostly good. A few missing ones that I'm aware of:

We can make the timestamp layout opinionated by default, but I think it'd be useful to let users configure that. Maybe they'd like the timestamp to go first, for example, or they want to segment by hour.

StephenWakely commented 3 years ago

As far as I know the Array part isn't possible to do at the minute.

By the time the function receives the array all the path names have been evaluated, so we only have access to their values and not the path names. The array may not even have path names. So for example to_hive_partition([to_hive_partition(), sha2("blah"), 34]) would compile fine.

vladimir-dd commented 3 years ago

Since the proposed design 1) don't enforce the field order with the current implementation based on BTreeMap(can be fixed by switching to IndexMap) and 2)isn't flexible around timestamp we suggest an alternative solution, based on the template syntax we already support, which addresses both issues:

to_hive_partition("env={{environment}}/app_id={{ application_id }}/year=%Y/month=%m/day=%d/")

This syntax should be already familiar to users. On the other side it becomes so flexible that one can question the need of this specific function at all. So alternatively we could create some common function(e.g. template()) for all possible use cases and add url_safe parameter to make sure it's url-encoded:

template(("env={{environment}}/app_id={{ application_id }}/year=%Y/month=%m/day=%d/"), url_safe=true, limit=256)

cc @jszwedko

@binarylogic , @FungusHumungus what do you think?

StephenWakely commented 3 years ago

Yes. This is probably not as convenient for the user as the original issue requested, but this is a good way to do it that does't require making any changes to the VRL compiler. Plus these functions can be used in a wide number of other scenarios too.

binarylogic commented 3 years ago

@vladimir-dd given that this was not as simple as I originally thought, why don't we table this and come back to it when we have firmer requirements.