Closed hkvision closed 3 years ago
compared to az 0.11.0, new version of gen_string_index and encode_string generates more data and lose some data, then it gets nulls. Here is sample code, data will be sent in email
from zoo.friesian.feature import FeatureTable from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() input = "/Users/guoqiong/intelWork/git/analytics-zoo/pyzoo/preprocessed_5/cat_str" item_df = spark.read.parquet(input) item_tbl = FeatureTable(item_df) category_index = item_tbl.gen_string_idx(["category"], 1) encoded = item_tbl.encode_string(["category"], category_index) print(item_tbl.df.select("category").distinct().count()) print(category_index[0].df.select("category").count()) print(encoded.df.select("category").count())
@jenniew Please check your modifications with gen_string_idx and encode_string
@songhappy
category_index = item_tbl.gen_string_idx(["category"], 1)
1
should be removed since jiao adds a new param in gen_string_idx and this 1
is passed as doSplit
not freq_limit
I think we would be better to pass kwargs when calling gen_string_idx and encode_string. And I'll also update these APIs to keep same order of old ones.
@yizerozhuang Please confirm this issue has been fixed on the latest code.
If I use az 0.11.0 I can successfully run dien_preprocessing.py on spark local. If I use the latest nightly build (0.12.0.dev0 on our nexus) or some other recent versions (0.12.0b20210830), I will get the following error:
Reproduceable on clx006 using
dien
conda environment.