teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 2 forks source link

timechart divideByInstruction generates only one column of data #237

Open 51-code opened 5 months ago

51-code commented 5 months ago

Describe the bug

The query: index=abc OR index=xyz earliest=2021-10-22T03:00:00.0 | timechart count by index results in a dataset that has the columns "count" and "index". If there are data on the same date in both indexes, there are two rows of data for that day, one row for the index abc and one for xyz.

Expected behavior

Resulting dataset should have as many columns as there are values in the index column (or generally, the values of the column specified in the BY clause). So in this case there should be the date column and then the columns abc and xyz. This also means that a time period is only present once in the data, no duplicates.

If there was more than 10 values in the index column, there should only be the 10 most largest indexes as their own columns and the rest are added together and presented in a column "OTHER".

How to reproduce

Run the query above.

Screenshots

Software version

PTH-03: 5.4.0 PTH-10: 4.18.0-4-ga632493a

Desktop (please complete the following information if relevant):

Additional context

51-code commented 5 months ago

The new columns can be created with the dataset's pivot function. However, pivot doesn't work with streaming datasets, so timechart has to be changed to be a "sequential only" command (not streaming).

This surfaced an issue where timechart doesn't join the dataset with the time buckets (e.g. one row of data per day) correctly in sequential mode, it kind of overwrites the data. Fix for this is in progress.

51-code commented 4 months ago

Committed a code for the dev branch where columns are now created correctly. However, it's still not perfect as there is an ongoing issue with BatchCollect doing it's own sorting even when no sort column is given to it, leading to a descending instead of ascending order for _time column. See https://github.com/teragrep/dpf_02/pull/13

Otherwise, the limit to 10 columns and the OTHER column still needs to be done and I have to make sure that the columns are created even if there is no data at all in the index, or the timeframe given in earliest/latest.

51-code commented 4 months ago

Empty datasets have now been taken into account in the implementation. On further review, the limit and OTHER column are better taken care of at the same time as making the limit parameter, continuing that in #213.

51-code commented 4 months ago

Still had minor issues when used together with predict command, but those are fixed now. Starting to test in QA.

51-code commented 4 months ago

QA testing showed that batches append the results instead of joining them together, i.e. the forEachBatch was in Append OutputMode.

Refactored timechartStep to three steps. This allows timechart to run the aggregates first in parallel mode and only change to forEachBatch if there is a By-clause given in the command, which is the only part that doesn't work in parallel mode. Now the results are joined together correctly.

51-code commented 2 months ago

Limit parameter is almost ready. When using just one aggregate in timechart, it already works correctly. But there is a possibility for multiple aggregations which results in having to rename and reposition most of the columns after pivoting the dataset, and this is still a work in progress.

Useother parameter will be ready almost immediately after limit, it has been developed almost in parallel with the limit, as they are strongly connected.