linkedin / detext

DeText: A Deep Neural Text Understanding Framework for Ranking and Classification Tasks
BSD 2-Clause "Simplified" License
1.26k stars 133 forks source link

how to generate wide sparse features #73

Open kiminh opened 2 years ago

kiminh commented 2 years ago

Hi,I'm confused about how to generate the wide sparse features. Here is my understanding: combine the multi field categorical features together and form the multi hot sparse feature. then the index is generated by hash value or simliar way like the labelencode way?

kiminh commented 2 years ago

I mean every single field categorical feature has its vocabulary, then multiple field categorical features have multiple vocabularies. then the vocabulary of the multi hot sparse feature is the union set of multiple vocabularies, and index the multiple field categorical feature. Or just use the hash way to index the categorical feature like string "field_name:categorical feature value", this way may have some conflicts but don't have to maintain the whole vocabulary.

StarWang commented 2 years ago

Hi @kiminh, I assume that your question is based on DeText-TF2. In DeText TF2, each sparse feature field (wide part) is a multi hot vector. This vector should be generated by user beforehand (e.g. hashing). The vocab size can be passed to DeText through nums_sparse_ftrs.

The vocab for each field is independent of each other. There's no correlation between them.