reczoo / FuxiCTR

A configurable, tunable, and reproducible library for CTR prediction https://fuxictr.github.io
Apache License 2.0
905 stars 153 forks source link

Maybe a bug found in feature_preprocess.py #103

Closed ywangwxd closed 2 months ago

ywangwxd commented 2 months ago

I am testing the latest commit of this repo with the criteo dataset. There is a self-defined preprocessor "convert_to_bucket"。 This preprocessor need a column name as its argument. I got error in this piece of code:


1.     def preprocess(self, ddf):
2.         logging.info("Preprocess feature columns...")
3.         all_cols = self.label_cols + self.feature_cols[::-1]
4.         print(all_cols)
5.         for col in all_cols:
6.             name = col["name"]
7.             #if name in ddf.columns:
8.             if name in  ddf.collect_schema().names():
9.                 print(name)
10.                 fill_na = "" if col["dtype"] in ["str", str] else 0
11.                 fill_na = col.get("fill_na", fill_na)
12.                 ddf = ddf.with_columns(pl.col(name).fill_null(fill_na))
13.             if col.get("preprocess"):
14.                 preprocess_args = re.split(r"\(|\)", col["preprocess"])
15.                 preprocess_fn = getattr(self, preprocess_args[0])
16.                 print(preprocess_args)
17.                 ddf = ddf.with_columns(
18.                     #preprocess_fn(*preprocess_args[1:-1])
19.                     preprocess_fn(name, *preprocess_args[1:-1])
20.                     .alias(name)
21.                     .cast(self.dtype_dict[name])
22.                 )
23.         active_cols = [col["name"] for col in all_cols if col.get("active") != False]
24.         ddf = ddf.select(active_cols)
25.         return ddf

The error happened at line 18, where if no arguments are provided, _ddf.with_columns(preprocessfn().alias(name).cast(xxx) will throw exception. At least the column name should be given to this function. I have tested my change, it seems right. So does this mean, we should add the column name into the function arguments at least?

zhujiem commented 2 months ago

In your case, the configuration should be:

preprocess: convert_to_bucket(col_name)
zhujiem commented 2 months ago

To use the latest version for Criteo_x4, it is expected to update the configuration file accordingly.