quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.82k stars 315 forks source link

About Query Criteria Time Interval #3165

Open yangshike opened 1 year ago

yangshike commented 1 year ago

I want to query data for a given time interval, but I don't know how to use this query statement my config:

Such as i want to query msgTime between '2023-01-01 00:00:00' and '2023-01-31 23:59:59'

curl "http://127.0.0.1:7280/api/v1/wework/search?query=msgContent:sss+AND+msgTime:??'"

fmassot commented 1 year ago

@yangshike you should use start_timestamp and end_timestamp query parameters that are integers and expressed in seconds. More details in the docs: https://quickwit.io/docs/reference/rest-api#parameters-1

yangshike commented 1 year ago

But my msgTime format is'% Y -% m -% d% H:% M% S', and I also need this format in storage. Is there any other way to generate a timestamp field

when i change config file to:

i got error :

fmassot commented 1 year ago

You don't need to change the format.

Your curl should be like this:

curl "http://127.0.0.1:7280/api/v1/wework/search?query=*&start_timestamp=1681154472"

start_timestamp and end_timestamp are specific parameters valid only for the timestamp field, it allows split pruning based on timestamps, see the docs here.

fmassot commented 1 year ago

@evanxg852000 I think we need a guide on the timestamp field pruning.

PSeitz commented 1 year ago

The date format to query should be rfc3339. The queryparser can handle ranges in the format msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}

Inclusive bounds are [], exclusive are {}

Range queries need to have a fast field (which is the case here)

evanxg852000 commented 1 year ago

I would just like to add that for the internal storage of the value, they all end up in timestamps, and the detail are here

@yangshike we will add a guide on this and possibly the limitation on query parser supporting only one format when querying will be lifted when the new query pipeline lands.

yangshike commented 1 year ago

The date format to query should be rfc3339. The queryparser can handle ranges in the format msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}

Inclusive bounds are [], exclusive are {}

Range queries need to have a fast field (which is the case here)

curl "http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]

yangshike commented 1 year ago

I would just like to add that for the internal storage of the value, they all end up in timestamps, and the detail are here

  • start_timestamp and end_timestamp are not part of your index fields, these are metadata fields to help pruning split. They cannot be used in the query as fields.
  • The datetime field you can actually use in your query is msgTime and the only format supported in the query parser as of now is rfc3339

@yangshike we will add a guide on this and possibly the limitation on query parser supporting only one format when querying will be lifted when the new query pipeline lands.

The pruning of time intervals is very important. I would like to know how to prune by time intervals when generating index data, or if adding time intervals during my query can reduce the amount of scanned data

yangshike commented 1 year ago

The date format to query should be rfc3339. The queryparser can handle ranges in the format msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z} Inclusive bounds are [], exclusive are {} Range queries need to have a fast field (which is the case here)

curl "[http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: [http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z] the documents: https://quickwit.io/docs/reference/query-language : Range queries Range queries can only be executed on fields with a fast field. Currently only fields of type ip are supported.

PSeitz commented 1 year ago

The query needs to be properly escaped. JSON as query body is easier to avoid curl escaping:

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]"}' http://127.0.0.1:7280/api/v1/wework/search

the documents: https://quickwit.io/docs/reference/query-language : Range queries Range queries can only be executed on fields with a fast field. Currently only fields of type ip are supported.

Thanks for pointing that out, that documentation is outdated and needs to be updated. Datetime and numeric fields are supported.

yangshike commented 1 year ago

The date format to query should be rfc3339. The queryparser can handle ranges in the format msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z} Inclusive bounds are [], exclusive are {} Range queries need to have a fast field (which is the case here)

curl "[http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: [http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]

The query needs to be properly escaped. JSON as query body is easier to avoid curl escaping:

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]"}' http://127.0.0.1:7280/api/v1/wework/search

thanks, it works ok!

yangshike commented 1 year ago

I have another question. curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search

These two queries consume almost the same amount of time,

Is time trimming ineffective in improving query speed?

evanxg852000 commented 1 year ago

The pruning of time intervals is very important. I would like to know how to prune by time intervals when generating index data, or if adding time intervals during my query can reduce the amount of scanned data

Yes the time pruning that is designed to reduce the amount of scanned data (split) is the one used on start_timestamp and end_timestamp and those values should be specified outside of the query as request parameters.

It also depends on the number of splits you have in relation with your query. Example: let say you query with name:foo and the number of splits you have in your index is 10. These 10 splits will be scanned. but if you apply timestamp pruning on start_timestamp and end_timestamp then you might just end up scanning only 1 split considering your split's timestamps are disjoint. if they overlap then maybe 3 o 4 at least in most cases it is expected to be less than the overall 10.

yangshike commented 1 year ago

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search

Based on this query, how can i added added starttimestamp and end timestamp?

evanxg852000 commented 1 year ago

if my conversion is correct it should be: 2023-04-07T15:00:00Z -> 1680879600 seconds 2023-04-07T18:00:00Z -> 1680890400 seconds

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search?start_timestamp=1680879600&end_timestamp=1680890400

The time pruning on splits (starttimestamp and end timestamp) is applied first, then the search for msgTime is done.

Note: you cannot just use msgTime:*, you still have to restrict your field msgTime since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]

PSeitz commented 1 year ago

I would not mix post and get parameters, the query should be: curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]", "start_timestamp":1680879600, "end_timestamp":1680890400}' http://127.0.0.1:7280/api/v1/wework/search

yangshike commented 1 year ago

Note: you cannot just use msgTime:*, you still have to restrict your field msgTime since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]

I think if my msgTime interval is the same as start_timestamp and end_timestamp interval , so there is no need for msgTime for query criteria

yangshike commented 1 year ago

I would not mix post and get parameters, the query should be: curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]", "start_timestamp":1680879600, "end_timestamp":1680890400}' http://127.0.0.1:7280/api/v1/wework/search

thank you very much ,its work ok!

yangshike commented 1 year ago

Here comes a new problem!!

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" the response: { "num_hits": 0 }

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-7T18:00:00Z] AND msgContent:yyyy"}' "http://127.0.0.1:7280/api/v1/wework/search" the response: { "num_hits": 2}

curl -X POST -H 'Content-Type: application/json' -d '{"query": "mgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" { "num_hits": 0 」

this is the msgTIme in my documents: { "msgSeq": 1206588, "msgTime": "2023-04-07T15:13:20Z", "msgType": "text" } { "msgSeq": 1210152, "msgTime": "2023-04-07T17:54:59Z", "msgType": "text" }

PSeitz commented 1 year ago

end_timestamp 1680861600 = Apr 07 2023 10:00:00 So no results is correct. There is no timezone (UTC or GMT+0).

I used this tool: https://www.epochconverter.com/

yangshike commented 1 year ago

The time zone should be related to the environment variables of the operating system, is it normal for me to query this time on the operating system?

I think the problem lies in metastore.json? The timestamp time zone generated by quickwit during index generation is incorrect Do I need to set the time zone for quwickwit?

clickhouse01 :) select toDateTime(1680861600);

SELECT toDateTime(1680861600)

Query id: e897094b-b9ba-4eff-bfa5-fa3114040401 ┌─toDateTime(1680861600)─┐ │ 2023-04-07 18:00:00 │ └────────────────────────┘

1 row in set. Elapsed: 0.001 sec.

PSeitz commented 1 year ago

There's not timezone handling in quickwit yet, so it will not take the timezone of your OS. Queries need to be expressed in UTC/GMT+0 timezone.

yangshike commented 1 year ago

I have temporarily resolved this issue because there is a timestamp field in my source data, and I have set this field to timestamp_ field: msgTimestamp

previous timestampfield in config file is : timestamp field: msgTime

but my msgTime like this : "msgTime": "2023-04-07T17:54:59Z",

my config file :

quickwit Need to convert msgTime to timestamp (What is the time zone selection during this process

I think the problem is that Quickwit chose the wrong time zone when converting msgTime to timestamp

PSeitz commented 1 year ago

Since there's no timezone handling in quickwit, indexed data will be interpreted as UTC.

I'm not sure if the VRL feature can be used to normalize the indexed data to UTC ? @guilload

guilload commented 1 year ago

It does not seem that there is anything wrong with parsing the field msgTime in Quickwit. 2023-04-07T17:54:59Z is a valid RFC 3339 datetime. The suffix Z indicates that the timezone is UTC. I ran the unit tests for the RFC 3339 datetime parser, and it passed.

let date_time = parse_rfc3339("2023-04-07T17:54:59Z").unwrap();
assert_eq!(date_time, datetime!(2023-04-07 17:54:59 UTC));

I believe the issue is not passing UTC timestamps to Quickwit in the query. If I use your ClickHouse example, 2023-04-07 18:00:00 in RFC 3339 format (UTC) is 2023-04-07T18:00:00Z. The corresponding timestamp is 1680890400, not 1680861600.

 let date_time = parse_rfc3339("2023-04-07T18:00:00Z").unwrap();
assert_eq!(date_time, datetime!(2023-04-07 18:00:00 UTC));
assert_eq!(date_time.unix_timestamp(), 1680890400);
yangshike commented 1 year ago

Okay, it could be that there was a problem with the source data conversion. My msgTime in kafka is as follows: msgTime: "2023-04-07 17:54:59"

But after saving into quwickwit, the msgTime will automatically change to: msgTime:"2023-04-07T17:54:59Z "

How can I save it and convert it to the time zone I want

guilload commented 1 year ago

Ok, so in that case, it will use the second parser and assume the timezone is UTC. Unfortunately, at the moment, Quickwit does not support specifying a timezone manually.

The solution that @PSeitz mentioned is to add a transform parameter to your source to append the correct time zone.

source_id: ...
source_type: kafka
...
transform:
  script: |
    .msgTime = join!([.msgTime, "+08:00"], separator: " ")

Under the hood, it is using VRL.

Don't forget to modifier the datetime parser format to "%Y-%m-%d %H:%M:%S %z".

yangshike commented 1 year ago

Okay, thank you. This issue has been temporarily resolved,

There should be a certain performance loss through transform script, right,

However, it would be best if you could manually specify the time zone on quwickwit.

yangshike commented 1 year ago

if my conversion is correct it should be: 2023-04-07T15:00:00Z -> 1680879600 seconds 2023-04-07T18:00:00Z -> 1680890400 seconds

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search?start_timestamp=1680879600&end_timestamp=1680890400

The time pruning on splits (starttimestamp and end timestamp) is applied first, then the search for msgTime is done.

Note: you cannot just use msgTime:*, you still have to restrict your field msgTime since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]

I tested it, and if only use start_timestamp and end_timestamp, The performance of is twice for

Add start _timestamp and end_timestamp and msgTime: twice the ["xx" TO 'xx']

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":168080800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search"

"elapsed_time_micros": 248844

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/ai/v1/wework/search"

"elapsed_time_micros": 82504

fulmicoton commented 1 year ago

That's correct. We do not handle pruning when the range is supplied in the user query

yangshike commented 1 year ago

curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":168080800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" I also provided starttimestamp and end timestamp for this query。 Shouldn't we cut it first?

fulmicoton commented 1 year ago

Ah yes then pruning must have kicked in.

Post-pruning, the way the range is enforced is different in both case.

In the first query it is running as a Range query. In the second query this is a "post filter" at collection level today.

It might be worth investigating which one is best an make sure the best implementation kicks where it makes sense..