Open 51-code opened 4 months ago
Other possible solutions that came to mind for PTH-10:
Does this scale in O(n), as in 10x input results in 10x processing time? Our internal datasets are relatively small so it might be a good idea to verify if theres some more designing to do
I tested with the same query and with varying amounts of data. The original query (100% in this case) was 539071 records.
Summed up: Using auto sorting increases the query time incrementally, but it does seem to cap somewhere along the lines of 90-100%, so applying the automatic datatype check is O(n).
Below are the results.
Results with 185% of the dataset (query time increase of 91%):
Results with 100% of the dataset (query time increase of 75%):
Results with 53% of the dataset (query time increase of 98%):
Results with 38% of the dataset (query time increase of 53%):
Results with 22% of the dataset (query time increase of 27%):
Describe the bug
Using the automatic sorting type in sort command results in a significant increase of query time. The culprit seems to be the
numericalStringCheck()
function. The function should be implemented differently, performance in mind.Expected behavior
The automatic sorting shouldn't increase the query time too much.
How to reproduce
Run sort first with default sorting:
The query took 4 min 22 sec for me.
Then run sort with the auto sorting:
The query took 7 min 39 sec for me, almost doubling the query time.
sort
can also take multiple columns to sort with. Two columns with auto sorting would again increase the query time close to 11 minutes.Screenshots
Software version
DPF-02 version 3.0.0 PTH-10 version 5.3.0-7-ge44d00e9
Desktop (please complete the following information if relevant):
Additional context
The auto sorting is a very useful tool for many cases because in PTH-10 some commands change the datatype of columns to String, as they use Spark's User Defined Functions that can only return a single datatype. The downside for that is that it brakes any sorting for numerical values, which in turn the auto sorting deals with.
For example in PTH-10 issue #256 default sorting for
chart
andstats
are being made, but they suffer from the same performance issues, if the auto sorting is to be used to fix the problem of using e.g.spath
before the commands. (spath
uses UDF's and changes everything in the dataset to String)Matching numbers with regex in
numericalStringCheck()
already tried, but it didn't improve performance.