teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 2 forks source link

chart and stats don't sort results when grouped by _time #256

Closed ronja-ui closed 2 weeks ago

ronja-ui commented 3 months ago

Describe the bug chart and stats commands don't sort results correctly when they are grouped by _time column. However, timechart sorts results as expected.

Expected behavior Results should be sorted in a chronological order when they are grouped by _time column.

How to reproduce

%dpl
index=crud earliest=-3y
| spath
| chart avg(balance) by operation _time

or replace chart with stats, results are the same.

From graph UI settings, put _time in keys, operation in groups and balance in values.

For a comparison, see the same with timechart:

%dpl
index=crud earliest=-3y
| spath
| timechart avg(balance) by operation

Screenshots

Software version pth_10 version: 4.20.1 pth_06 version: 2.3.0 pth_03 version: 5.7.0

Desktop (please complete the following information if relevant):

Additional context

51-code commented 2 months ago

Works in timechart because it calls .orderBy("_time") for the dataset as the final operation in the command.

There doesn't seem to be any default sorting happening for chart or stats right now.

ronja-ui commented 2 months ago

Using sort with chart and stats seem to work. I think this should be the solution to this problem. :)

51-code commented 2 months ago

That indeed fixes the problem. The bug still remains that there should be sorting happening for the column specified in the "by" instruction. This is a problem with not just the _time but any other column as well. So the sorting should still happen inside the command.

Furthermore, the | chart avg(balance) by operation _time does not actually work as it should. The first argument of the "by" instruction should be how to divide the results in rows (called t_row_Parameter in PTH-03) and this works as expected. The second argument (called t_column_Parameter in PTH-03) has a completely different meaning and it should pivot the dataset so that the unique values in that column will be the new columns of the dataset. However, now the command actually puts them both in the same List of "divide by" instructions and operates both similarly. I will make a new issue about that bug.

51-code commented 2 months ago

chart somewhat sorted the results for other columns, but in descending order even though it should be ascending. _time column seemed to be completely random.

stats had no sorting at all.

No sorting seems to be a problem because these are both aggregation commands, so the sorting that would happen in BatchCollect (in dpf_02) is not active.

Implemented sorting for both commands, moving on to test in QA.

51-code commented 2 months ago

QA testing notes: Both commands sort properly now if only one BY column is used. For two (or multiple in case of stats) BY columns, only the last will be used for sorting. This was unexpected because SortOperation in BatchCollect actually takes in a list of SortByOperations (the columns to sort by), but it just keeps overriding the sorting for each of them so only the last one will have any effect.

What now:

Trying to use dataset.orderBy() with multiple columns in dpf_02, now it runs that function for each column separately.

51-code commented 2 months ago

Made a PR to DPF-02 that should fix the issue in the stats command.

51-code commented 2 months ago

stats sorts multiple columns correctly now, after the DPF-02 fix was merged. However, there is an issue in both stats and chart, where the sorting won't work correctly when certain commands like spath and eval are used before them. These commands change the datatype to String. This is the same issue that was already fixed for sort in #153. Only that this time the automatic sorting doesn't work, as it utilizes functions that are not available for structured streaming. This didn't matter with sort as the command does not operate in streaming mode.

Solution would be to either come up with a way to do the automatic sorting in DPF-02 in streaming mode, or just change chart and stats to sequential_only commands like sort. It would be smart to utilize the SortStep in chart and stats to prevent code duplication if the automatic sorting for streaming mode is not possible.

51-code commented 2 months ago

Internal PR created.