stats dc on a column with no data produces NullPointerException

kortemik commented 2 months ago

Describe the bug

following query

| makeresults | eval raw="kissa@1"| rex4j field=raw "koira@(?<koira>\d)"

produces a column named koira without data

_time   raw koira
2024-08-29T11:12:22.000+03:00   kissa@1 No data provided

running distinct count on the the column as follows

| makeresults | eval raw="kissa@1"| rex4j field=raw "koira@(?<koira>\d)" | stats dc(koira)

produces NullPointerException

java.lang.NullPointerException
    at org.apache.zeppelin.interpreter.InterpreterOutput.write(InterpreterOutput.java:334)
    at org.apache.zeppelin.interpreter.InterpreterResult.add(InterpreterResult.java:90)
    at org.apache.zeppelin.interpreter.InterpreterResult.<init>(InterpreterResult.java:75)
    at com.teragrep.pth_07.DPLExecutor.interpret(DPLExecutor.java:237)
    at com.teragrep.pth_07.DPLInterpreter.internalInterpret(DPLInterpreter.java:165)
    at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:860)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:752)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
    at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
    at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

please investigate the underlying cause as the exception is raised outside the pth_10. perhaps the resulting dataframe is null?

Expected behavior should give a count 0 for column koira

How to reproduce run the query

Screenshots

Software version pth_10 version: 6.0.0 pth_06 version: 3.1.2 dpf_02 version: 3.0.0 jpr_01 version: 3.1.1 pth_03 version: 6.1.4 jue_01 version: 0.4.3 dpf_03 version: 10.0.1

Desktop (please complete the following information if relevant):

OS:
Browser:
Version:

Additional context

eemhu commented 2 months ago

Running via pth-10 unit test results in:

java.lang.NullPointerException
    at com.teragrep.pth10.ast.commands.aggregate.UDAFs.DistinctCountAggregator.reduce(DistinctCountAggregator.java:130)

eemhu commented 2 months ago

DistinctCountAggregator does not know how to handle null value.

eemhu commented 2 months ago

Internal PR submitted with fixes.

teragrep / pth_10

stats dc on a column with no data produces NullPointerException #291