[Suggestion] Update the logic of preprocessing for efficiency

Suggest to update the logic of preprocessing for efficiency.

In many cases, a user's behavior sequence is the same for all training samples. In addition, the features of a user or an item are often the same for all training samples.

However, the current version of FuxiCTR receives the training dataset as a single DataFrame, so these features (e.g., a user's behavior sequence, the features of a user or an item) should be stored redundantly in that DataFrame, which consumes too much memory (especially, in large-scale dataset). Also, fit/transform of feature_preprocessor should be performed on redundant behavior sequences and features, which takes too long (especially, in large-scale dataset).

So, to operate more efficiently, I hope these redundancies are removed. To this end, I suggest to change the logic of preprocessing to receive user_df and item_df for each dataset together and fit/transform unique features (i.e., user_df and item_df).

reczoo / FuxiCTR

[Suggestion] Update the logic of preprocessing for efficiency #60