motefly / DeepGBM

SIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
647 stars 135 forks source link

func trans_cate_data in data_helpers.py for fast version cateNN #17

Closed hanfu closed 4 years ago

hanfu commented 4 years ago

Hi, can you elaborate on the " fast version cateNN" approach? How does it work?

motefly commented 4 years ago

In short, it's for transferring the feature IDs to accumulated ones, thus the model can get the embedding once time, in all fields.

hanfu commented 4 years ago

Thanks for your reply! if I understand correctly, it combines multiple fields of features to one big field, by brute-force accumulating field values. Ex. two two-column fields [(1,2),(1,1)] becomes [1,2,3,3]. But how does it help faster embedding? Is there any paper on this? Thanks again for your help!

motefly commented 4 years ago

Yes. In the codebase I referred, the embedding lookings up are running in a "for loop" for each field, thus I rewrite it and look up the embedding unitedly. That's just a little optimization in code implementation.

hanfu commented 4 years ago

Hi, thanks again for your reply. what is the codebase you refer to? Also my question is now ordinal encoding + embedding. As far as I understand, embedding uses one-hot encoding to look up embedded vectors for features. How does ordinal encoding work with embedding? Thanks in advance for your patience. Really appreciated.

motefly commented 4 years ago

For the original codebase, you can refer to https://github.com/nzc/dnn_ctr/blob/master/model/DeepFM.py#L202 . In PyTorch, the nn.embedding forward works by ordinal encoding, refer to https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding (see examples), though exactly one-hot encoding works inside its implementation.

hanfu commented 4 years ago

ahhh. everything now makes sense. Thank you very much!