[FEATURE] libffmconverter on a list of multiple features within one field

recommenders-team / recommenders

Best Practices on Recommendation Systems

https://recommenders-team.github.io/recommenders/intro.html

MIT License

18.86k stars 3.08k forks source link

[FEATURE] libffmconverter on a list of multiple features within one field #843

Open yexing99 opened 5 years ago

yexing99 commented 5 years ago

Description

For genre types on movielens, one movie can have multiple genres. When movielens data was loaded and split by "|", for the field 'genres_string', some records contain multiple features as list . for example, [Comedy, Drama] or [Action, Comedy, Drama]. Our current libffm converter can't handle it.

Expected behavior with the suggested feature

get distinct feature list from feature combination and then do conversion

Other Comments

yueguoguo commented 5 years ago

Currently, the converter does not support this. It won't be hard to firstly melt a dataframe, and then apply the conversion. Alternatively, the original genres can be encoded with, for example, one-hot encoding scheme, before it is converted.

yexing99 commented 5 years ago

YEs, unpivot and followed by some processing would be as a workround but for one hot encoding,it may not work nicely. For example, [0 0 0 0 1 1 1 0 0 0 0] as an input to the converter, the output will be something like 5:1:0 5:2:0 5:3:0 5:4:0 5:5:1 etc. those zeros should be removed as the original inputs are actually category variables.

gramhagen commented 5 years ago

@yexing99 in the example you provide, would the input be the same if the feature had a zero value or if it wasn't provided at all? That's how I thought the non-existent values were interpreted in LibFFM sparse format, but maybe I'm missing something here.

yueguoguo commented 5 years ago

@yexing99 Yea fair point - an alternative is to directly encode the feature (i.e., each genre) within a row of the genre features to avoid zeros produced from a pre-encoding step.

@gramhagen usually, missing values would be interpreted as zeros, and in the libffm format we don't need to explicitly represent those zeros.

gramhagen commented 5 years ago

It seems like there's a reasonable workaround. But this feature could also be included. How do you want to handle it @yueguoguo ?