[feature request] index name logstash pattern.

jeesim2 commented 8 years ago

Hello.

Thanks for develop such a cool utility. I have moved from logstash to this esbulk. Btw, I have a small leak of function with this utility.

We usually have log files which contains date field. and we create index with logstash index pattern. (e.g logstash-2016.05.30) But In some(or many) case dates of single file can be spreaded over several days, particularly local date based rolling strategy forced.

For example event_20160530.json may have these lines

...
{"time":"2016-05-30T00:00:00.000+0900"} // <--- log 1 ( 05/29 15:00 in UTC )
{"time":"2016-05-30T10:00:00.000+0900"} // <--- log 2 ( 05/30 01:00 in UTC )
...

However, elasticsearch and kibana forces UTC convert. So log 1 have to logstash-2016.05.29 and log 2 have to logstash-2016.05.30.

I know it is not a simple problem. But could you please consider feature something like this?

-index logstash-{yyyy.MM.dd} -date_field time -date_field_pattern yyyy-MM-dd'T'hh:mm:ss.SSSZ

miku commented 8 years ago

Thanks for the suggestion, indeed, I haven't had such a use case, but it looks interesting. If I understand correctly, a single file's records should end up in different indices, based on a certain property of the record.

I'm not sure yet, whether this is a genuine fast indexing use case or more a preprocessing thing (split input file on correct boundaries, then index). Let me think about it.

jeesim2 commented 8 years ago

@miku Thanks for the response. As you know logstash-file-input was not invented for read Large complete file. https://github.com/logstash-plugins/logstash-input-file/issues/78 So In many case esbulk can be a alternatives, include for me.

whether this is a genuine fast indexing use case

I agree that bulk processing's first goal is fast indexing.

split input file on correct boundaries, then index

Also I have considered do some preprocessing to split to individual date's file. but as I have to do that every day, it is a little burdened.

Cheers, Jihun

miku commented 8 years ago

Just a quick update: I implemented a first version of dynamic date support - here's a short screencast.

For a given file like this:

$ cat fixtures/dynamic-1.ldj
{"time":"2016-05-01", "name": "a"}
{"time":"2016-05-02", "name": "b"}
{"time":"2016-05-03", "name": "c"}

One can use the golang-style date spec to set a date field and a date field layout:

$ esbulk -verbose -index test-{2006-01-02} -date-field time \
         -date-field-layout 2006-01-02 fixtures/dynamic-1.ldj

The result would be three indices: test-2016-05-01, test-2016-05-02, test-2016-05-03 with one document each.

Another example:

$ cat fixtures/dynamic-2.ldj
{"time":"2016-05-30T10:00:00.000+0900", "name": "a"}
{"time":"2016-05-30T00:00:00.000+0900", "name": "b"}

$ esbulk -verbose -index test-{2006-01-02} -date-field time \
         -date-field-layout 2006-01-02T15:04:05Z0700 fixtures/dynamic-2.ldj

The result would be two indices: test-2016-05-29 and 2016-05-30, due to conversion to UTC.

Just a few points, that make this feature kind of difficult, at least with the current overall implementation:

For the dynamic index feature, we have to parse each document into JSON. With a slower elasticsearch, this might not be a big issue - but it should be benchmarked.
What to do with time zones. As it is implemented now, a value like 2016-05-30T10:00:00.000+0900 will be parse as a date with a timezone. As I understand, it would be better for kibana to convert these dates to UTC. Maybe there is the need for another option, like -convert-to-utc or something like that.

Here's another screencast, showing UTC conversion.

The code for all this is in https://github.com/miku/esbulk/tree/issue-1, feel free to check it out and test it. I am still a bit hesitant to include this, but if you think it would be useful, I will certainly consider it.

miku commented 8 years ago

I'm afraid I cannot implement this at the moment. It would add yet another two flags and I cannot think of an easy way to support this for now.

jeesim2 commented 8 years ago

@miku thank you for the feedback!

miku commented 6 years ago

For the sake of completeness: There is a processor type, that can route documents based on date:

https://www.elastic.co/guide/en/elasticsearch/reference/current/date-index-name-processor.html

The purpose of this processor is to point documents to the right time based index based on a date or timestamp field in a document by using the date math index name support.

miku / esbulk

[feature request] index name logstash pattern. #2