teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 2 forks source link

Reading from hdfs not working #225

Closed ronja-ui closed 5 months ago

ronja-ui commented 6 months ago

Describe the bug

Reading from hdfs breaks results or gives wrong results.

Expected behavior

When using stats count with hdfs load, following error appears:

org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;

In addition, when using sourcetype as a keyword, it doesn't filter out results as expected.

How to reproduce

%dpl
| teragrep exec hdfs load [dataset-name] 
| dedup _raw
| stats count

This throws an exception. However, using the query without stats works just fine.

In addition, following doesn't give correct results:

%dpl
| teragrep exec hdfs load [dataset-name] 
| search sourcetype="[keyword]" 
| search NOT static

Screenshots

Software version

pth_03: 5.2.0 pth_06: 2.3.0 pth_07: 5.17.0 pth_10: 4.17.0

Desktop (please complete the following information if relevant):

Additional context

eemhu commented 6 months ago

Related issues: #232 #205 HDFS load itself is not the issue here, it is using aggregate (stats) after dedup (sequential only command).

eemhu commented 5 months ago

Internal PR 595 merged, closing