teragrep / pth_10

Data Processing Language (DPL) translator for Apache Spark
GNU Affero General Public License v3.0
0 stars 6 forks source link

Unify null fields to a certain specification of null #199

Closed eemhu closed 5 months ago

eemhu commented 7 months ago

Describe the bug Currently different commands may return different types of nulls: The string "null", empty string "" or Spark's NULL field.

Expected behavior All nulls should be the same to avoid confusion and issues in processing.

How to reproduce For example "dedup" command uses "null" for null values, but "spath" uses empty string.

Screenshots

Software version

pth_10 4.16.0

Desktop (please complete the following information if relevant):

Additional context https://sparkbyexamples.com/spark/spark-replace-empty-value-with-null-on-dataframe/ To be fixed in Spark3 version as the null specification changes from 2.4 -> 3.x

eemhu commented 7 months ago

Note: Check commands like isnull and isnotnull after this change.

eemhu commented 5 months ago

Based on some research, I think the following is the way to go:

  1. Any command that produces an empty row / no result should be the Spark's spec of null.
  2. "" empty string is NOT the same as null.
  3. stats count ignores only nulls, not empty strings. e.g. null() is skipped, but "" is not.

Basically need to go through all the commands that can produce a "empty row" and make sure it is of the Spark null spec.