3011

Check List

[x] New functionality includes testing.
[x] New functionality has been documented.
- [x] New functionality has javadoc added.
- [x] New functionality has a user manual doc added.
[x] API changes companion pull request created.
[x] Commits are signed per the DCO using --signoff.
[x] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jduo commented 1 month ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

jduo commented 1 month ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here.

YANG-DB commented 1 month ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

jduo commented 1 month ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.
If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

jduo commented 3 weeks ago

Requesting reviews from @LantaoJin @MaxKsyunz Thanks

YANG-DB commented 3 weeks ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

jduo commented 3 weeks ago

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

@YANG-DB , I used the PPL parser code from the Spark PR. The schema semantics seem to be the same AFAIK, but I haven't tried the Spark one out. Same with the handling of results when there aren't enough samples (returning NULL) @kt-eliatra ?

jduo commented 3 weeks ago

The majority of the implementation is done. There's some more work left to support datetime types. Only simple moving average is implemented, not weighted moving average.

jduo commented 3 weeks ago

Datetime support has been added so this is effectively code complete (only supporting simple moving average for this iteration).