intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

DIEN preprocessing error on latest whl #112

Closed hkvision closed 3 years ago

hkvision commented 3 years ago

If I use az 0.11.0 I can successfully run dien_preprocessing.py on spark local. If I use the latest nightly build (0.12.0.dev0 on our nexus) or some other recent versions (0.12.0b20210830), I will get the following error:

2021-09-06 19:20:37 ERROR Utils:91 - Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$padMatrix$1: (int, array<array<int>>) => array<array<int>>)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage23.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
2021-09-06 19:20:37 ERROR FileFormatWriter:70 - Job job_20210906191757_0070 aborted.
2021-09-06 19:20:37 ERROR FileFormatWriter:70 - Job job_20210906191757_0070 aborted.
2021-09-06 19:20:37 ERROR Utils:91 - Aborting task

Reproduceable on clx006 using dien conda environment.

songhappy commented 3 years ago

compared to az 0.11.0, new version of gen_string_index and encode_string generates more data and lose some data, then it gets nulls. Here is sample code, data will be sent in email

from zoo.friesian.feature import FeatureTable from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate() input = "/Users/guoqiong/intelWork/git/analytics-zoo/pyzoo/preprocessed_5/cat_str" item_df = spark.read.parquet(input) item_tbl = FeatureTable(item_df) category_index = item_tbl.gen_string_idx(["category"], 1) encoded = item_tbl.encode_string(["category"], category_index) print(item_tbl.df.select("category").distinct().count()) print(category_index[0].df.select("category").count()) print(encoded.df.select("category").count())

hkvision commented 3 years ago

@jenniew Please check your modifications with gen_string_idx and encode_string

cyita commented 3 years ago

@songhappy category_index = item_tbl.gen_string_idx(["category"], 1) 1 should be removed since jiao adds a new param in gen_string_idx and this 1 is passed as doSplit not freq_limit

jenniew commented 3 years ago

I think we would be better to pass kwargs when calling gen_string_idx and encode_string. And I'll also update these APIs to keep same order of old ones.

hkvision commented 3 years ago

@yizerozhuang Please confirm this issue has been fixed on the latest code.