Open yangshike opened 1 year ago
@yangshike you should use start_timestamp
and end_timestamp
query parameters that are integers and expressed in seconds. More details in the docs: https://quickwit.io/docs/reference/rest-api#parameters-1
But my msgTime format is'% Y -% m -% d% H:% M% S', and I also need this format in storage. Is there any other way to generate a timestamp field
when i change config file to:
i got error :
name: msgTime type: datetime input_formats:
curl "http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%3C=1681154472" {"message":"Invalid query: The date field has an invalid format"}
curl "http://127.0.0.1:7280/api/v1/wework/search?query=start_timestamp:%3C=1681154472"
{"message":"Invalid query: Unknown field start_timestamp
"}
You don't need to change the format.
Your curl should be like this:
curl "http://127.0.0.1:7280/api/v1/wework/search?query=*&start_timestamp=1681154472"
start_timestamp
and end_timestamp
are specific parameters valid only for the timestamp field, it allows split pruning based on timestamps, see the docs here.
@evanxg852000 I think we need a guide on the timestamp field pruning.
The date format to query should be rfc3339. The queryparser can handle ranges in the format msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}
Inclusive bounds are [], exclusive are {}
Range queries need to have a fast field (which is the case here)
I would just like to add that for the internal storage of the value, they all end up in timestamps, and the detail are here
start_timestamp
and end_timestamp
are not part of your index fields, these are metadata fields to help pruning split. They cannot be used in the query as fields.msgTime
and the only format supported in the query parser as of now is rfc3339
@yangshike we will add a guide on this and possibly the limitation on query parser supporting only one format when querying will be lifted when the new query pipeline lands.
The date format to query should be rfc3339. The queryparser can handle ranges in the format
msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}
Inclusive bounds are [], exclusive are {}
Range queries need to have a fast field (which is the case here)
curl "http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]
I would just like to add that for the internal storage of the value, they all end up in timestamps, and the detail are here
start_timestamp
andend_timestamp
are not part of your index fields, these are metadata fields to help pruning split. They cannot be used in the query as fields.- The datetime field you can actually use in your query is
msgTime
and the only format supported in the query parser as of now isrfc3339
@yangshike we will add a guide on this and possibly the limitation on query parser supporting only one format when querying will be lifted when the new query pipeline lands.
The pruning of time intervals is very important. I would like to know how to prune by time intervals when generating index data, or if adding time intervals during my query can reduce the amount of scanned data
The date format to query should be rfc3339. The queryparser can handle ranges in the format
msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}
Inclusive bounds are [], exclusive are {} Range queries need to have a fast field (which is the case here)curl "[http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: [http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z] the documents: https://quickwit.io/docs/reference/query-language : Range queries Range queries can only be executed on fields with a fast field. Currently only fields of type ip are supported.
The query needs to be properly escaped. JSON as query body is easier to avoid curl escaping:
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]"}' http://127.0.0.1:7280/api/v1/wework/search
the documents: https://quickwit.io/docs/reference/query-language : Range queries Range queries can only be executed on fields with a fast field. Currently only fields of type ip are supported.
Thanks for pointing that out, that documentation is outdated and needs to be updated. Datetime and numeric fields are supported.
The date format to query should be rfc3339. The queryparser can handle ranges in the format
msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z}
Inclusive bounds are [], exclusive are {} Range queries need to have a fast field (which is the case here)curl "[http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]" curl: (3) bad range in URL position 64: [http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:[2002-10-02T15:00:00Z](http://127.0.0.1:7280/api/v1/wework/search?query=msgTime:%5B2002-10-02T15:00:00Z) TO 2002-10-02T18:00:00Z]
The query needs to be properly escaped. JSON as query body is easier to avoid curl escaping:
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]"}' http://127.0.0.1:7280/api/v1/wework/search
thanks, it works ok!
I have another question. curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search
These two queries consume almost the same amount of time,
Is time trimming ineffective in improving query speed?
The pruning of time intervals is very important. I would like to know how to prune by time intervals when generating index data, or if adding time intervals during my query can reduce the amount of scanned data
Yes the time pruning that is designed to reduce the amount of scanned data (split) is the one used on start_timestamp
and end_timestamp
and those values should be specified outside of the query as request parameters.
It also depends on the number of splits you have in relation with your query.
Example: let say you query with name:foo
and the number of splits you have in your index is 10. These 10 splits will be scanned. but if you apply timestamp pruning on start_timestamp
and end_timestamp
then you might just end up scanning only 1 split considering your split's timestamps are disjoint. if they overlap then maybe 3 o 4 at least in most cases it is expected to be less than the overall 10.
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search
Based on this query, how can i added added starttimestamp and end timestamp?
if my conversion is correct it should be: 2023-04-07T15:00:00Z -> 1680879600 seconds 2023-04-07T18:00:00Z -> 1680890400 seconds
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search?start_timestamp=1680879600&end_timestamp=1680890400
The time pruning on splits (starttimestamp and end timestamp) is applied first, then the search for msgTime is done.
Note: you cannot just use msgTime:*
, you still have to restrict your field msgTime
since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]
I would not mix post and get parameters, the query should be:
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]", "start_timestamp":1680879600, "end_timestamp":1680890400}' http://127.0.0.1:7280/api/v1/wework/search
Note: you cannot just use msgTime:*, you still have to restrict your field msgTime since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]
I think if my msgTime interval is the same as start_timestamp and end_timestamp interval , so there is no need for msgTime for query criteria
I would not mix post and get parameters, the query should be:
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2002-10-02T15:00:00Z TO 2002-10-02T18:00:00Z]", "start_timestamp":1680879600, "end_timestamp":1680890400}' http://127.0.0.1:7280/api/v1/wework/search
thank you very much ,its work ok!
Here comes a new problem!!
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" the response: { "num_hits": 0 }
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-7T18:00:00Z] AND msgContent:yyyy"}' "http://127.0.0.1:7280/api/v1/wework/search" the response: { "num_hits": 2}
curl -X POST -H 'Content-Type: application/json' -d '{"query": "mgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" { "num_hits": 0 」
this is the msgTIme in my documents: { "msgSeq": 1206588, "msgTime": "2023-04-07T15:13:20Z", "msgType": "text" } { "msgSeq": 1210152, "msgTime": "2023-04-07T17:54:59Z", "msgType": "text" }
end_timestamp 1680861600 = Apr 07 2023 10:00:00 So no results is correct. There is no timezone (UTC or GMT+0).
I used this tool: https://www.epochconverter.com/
The time zone should be related to the environment variables of the operating system, is it normal for me to query this time on the operating system?
I think the problem lies in metastore.json? The timestamp time zone generated by quickwit during index generation is incorrect Do I need to set the time zone for quwickwit?
clickhouse01 :) select toDateTime(1680861600);
SELECT toDateTime(1680861600)
Query id: e897094b-b9ba-4eff-bfa5-fa3114040401 ┌─toDateTime(1680861600)─┐ │ 2023-04-07 18:00:00 │ └────────────────────────┘
1 row in set. Elapsed: 0.001 sec.
There's not timezone handling in quickwit yet, so it will not take the timezone of your OS. Queries need to be expressed in UTC/GMT+0 timezone.
I have temporarily resolved this issue because there is a timestamp field in my source data, and I have set this field to timestamp_ field: msgTimestamp
previous timestampfield in config file is : timestamp field: msgTime
but my msgTime like this : "msgTime": "2023-04-07T17:54:59Z",
my config file :
quickwit Need to convert msgTime to timestamp (What is the time zone selection during this process
)
I think the problem is that Quickwit chose the wrong time zone when converting msgTime to timestamp
Since there's no timezone handling in quickwit, indexed data will be interpreted as UTC.
I'm not sure if the VRL feature can be used to normalize the indexed data to UTC ? @guilload
It does not seem that there is anything wrong with parsing the field msgTime
in Quickwit. 2023-04-07T17:54:59Z
is a valid RFC 3339 datetime. The suffix Z
indicates that the timezone is UTC. I ran the unit tests for the RFC 3339 datetime parser, and it passed.
let date_time = parse_rfc3339("2023-04-07T17:54:59Z").unwrap();
assert_eq!(date_time, datetime!(2023-04-07 17:54:59 UTC));
I believe the issue is not passing UTC timestamps to Quickwit in the query. If I use your ClickHouse example, 2023-04-07 18:00:00
in RFC 3339 format (UTC) is 2023-04-07T18:00:00Z
. The corresponding timestamp is 1680890400
, not 1680861600
.
let date_time = parse_rfc3339("2023-04-07T18:00:00Z").unwrap();
assert_eq!(date_time, datetime!(2023-04-07 18:00:00 UTC));
assert_eq!(date_time.unix_timestamp(), 1680890400);
Okay, it could be that there was a problem with the source data conversion. My msgTime in kafka is as follows: msgTime: "2023-04-07 17:54:59"
But after saving into quwickwit, the msgTime will automatically change to: msgTime:"2023-04-07T17:54:59Z "
How can I save it and convert it to the time zone I want
Ok, so in that case, it will use the second parser and assume the timezone is UTC. Unfortunately, at the moment, Quickwit does not support specifying a timezone manually.
The solution that @PSeitz mentioned is to add a transform parameter to your source to append the correct time zone.
source_id: ...
source_type: kafka
...
transform:
script: |
.msgTime = join!([.msgTime, "+08:00"], separator: " ")
Under the hood, it is using VRL.
Don't forget to modifier the datetime parser format to "%Y-%m-%d %H:%M:%S %z"
.
Okay, thank you. This issue has been temporarily resolved,
There should be a certain performance loss through transform script, right,
However, it would be best if you could manually specify the time zone on quwickwit.
if my conversion is correct it should be: 2023-04-07T15:00:00Z -> 1680879600 seconds 2023-04-07T18:00:00Z -> 1680890400 seconds
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy"}' http://127.0.0.1:7280/api/v1/wework/search?start_timestamp=1680879600&end_timestamp=1680890400
The time pruning on splits (starttimestamp and end timestamp) is applied first, then the search for msgTime is done.
Note: you cannot just use
msgTime:*
, you still have to restrict your fieldmsgTime
since a split might contain a much larger range, therefore, documents that are outside of [2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z]
I tested it, and if only use start_timestamp and end_timestamp, The performance of is twice for
Add start _timestamp and end_timestamp and msgTime: twice the ["xx" TO 'xx']
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":168080800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search"
"elapsed_time_micros": 248844
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgContent:yyyy","start_timestamp":1680850800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/ai/v1/wework/search"
"elapsed_time_micros": 82504
That's correct. We do not handle pruning when the range is supplied in the user query
curl -X POST -H 'Content-Type: application/json' -d '{"query": "msgTime:[2023-04-07T15:00:00Z TO 2023-04-07T18:00:00Z] AND msgContent:yyyy","start_timestamp":168080800,"end_timestamp":1680861600}' "http://127.0.0.1:7280/api/v1/wework/search" I also provided starttimestamp and end timestamp for this query。 Shouldn't we cut it first?
Ah yes then pruning must have kicked in.
Post-pruning, the way the range is enforced is different in both case.
In the first query it is running as a Range query. In the second query this is a "post filter" at collection level today.
It might be worth investigating which one is best an make sure the best implementation kicks where it makes sense..
I want to query data for a given time interval, but I don't know how to use this query statement my config:
Such as i want to query msgTime between '2023-01-01 00:00:00' and '2023-01-31 23:59:59'
curl "http://127.0.0.1:7280/api/v1/wework/search?query=msgContent:sss+AND+msgTime:??'"