reczoo / FuxiCTR

A configurable, tunable, and reproducible library for CTR prediction https://fuxictr.github.io
Apache License 2.0
914 stars 157 forks source link

[Suggestion] Update the logic of preprocessing for efficiency #60

Closed Kimyungi closed 1 year ago

Kimyungi commented 1 year ago

Suggest to update the logic of preprocessing for efficiency.

In many cases, a user's behavior sequence is the same for all training samples. In addition, the features of a user or an item are often the same for all training samples.

However, the current version of FuxiCTR receives the training dataset as a single DataFrame, so these features (e.g., a user's behavior sequence, the features of a user or an item) should be stored redundantly in that DataFrame, which consumes too much memory (especially, in large-scale dataset). Also, fit/transform of feature_preprocessor should be performed on redundant behavior sequences and features, which takes too long (especially, in large-scale dataset).

So, to operate more efficiently, I hope these redundancies are removed. To this end, I suggest to change the logic of preprocessing to receive user_df and item_df for each dataset together and fit/transform unique features (i.e., user_df and item_df).

zhujiem commented 1 year ago

Thanks for the suggestion. If decoupling the dataset to user_df and item_df, the dataset cannot handle some cross features and real-time sequences. In some datasets we have tested, a user has different sequences which are computed according to the timestamp. It follows the common practice in industry. But the preprocessing should be accelerated if we partition the dataset into chunks and remove some redundancy.