Open yexing99 opened 5 years ago
Currently, the converter does not support this. It won't be hard to firstly melt a dataframe, and then apply the conversion. Alternatively, the original genres can be encoded with, for example, one-hot encoding scheme, before it is converted.
YEs, unpivot and followed by some processing would be as a workround but for one hot encoding,it may not work nicely. For example, [0 0 0 0 1 1 1 0 0 0 0] as an input to the converter, the output will be something like 5:1:0 5:2:0 5:3:0 5:4:0 5:5:1 etc. those zeros should be removed as the original inputs are actually category variables.
@yexing99 in the example you provide, would the input be the same if the feature had a zero value or if it wasn't provided at all? That's how I thought the non-existent values were interpreted in LibFFM sparse format, but maybe I'm missing something here.
@yexing99 Yea fair point - an alternative is to directly encode the feature (i.e., each genre) within a row of the genre features to avoid zeros produced from a pre-encoding step.
@gramhagen usually, missing values would be interpreted as zeros, and in the libffm format we don't need to explicitly represent those zeros.
It seems like there's a reasonable workaround. But this feature could also be included. How do you want to handle it @yueguoguo ?
Description
For genre types on movielens, one movie can have multiple genres. When movielens data was loaded and split by "|", for the field 'genres_string', some records contain multiple features as list . for example, [Comedy, Drama] or [Action, Comedy, Drama]. Our current libffm converter can't handle it.
Expected behavior with the suggested feature
get distinct feature list from feature combination and then do conversion
Other Comments