Closed ChandraLingam closed 6 years ago
You can find the data format from Section 2 of libFM 1.4.2 manual.
Thank you. yes, I did review the manual and was attempting to use the perl script for csv to libfm conversion
I created a small csv file using 16 rows from movielens ratings dataset and the script produced ratings_small.csv.libfm. Output does not seem to match the input (or at-least I not able to interpret what the script did)
triple_format_to_libfm.pl -in ratings_small.csv -target 2 -delete_column 3 -separator ","
transforming file ratings_small.csv to ratings_small.csv.libfm...
userId,movieId,rating,timestamp
1,31,2.5,1260759144
2,10,4.0,835355493
2,17,5.0,835355681
2,39,5.0,835355604
2,47,4.0,835355552
2,50,4.0,835355586
2,52,3.0,835356031
2,62,3.0,835355749
2,110,4.0,835355532
2,144,3.0,835356016
2,150,5.0,835355395
3,60,3.0,1298861675
3,110,4.0,1298922049
3,247,3.5,1298861637
3,267,3.0,1298861761
3,7153,2.5,1298921787
rating 0:1 1:1
2.5 2:1 3:1
4.0 4:1 5:1
5.0 4:1 6:1
5.0 4:1 7:1
4.0 4:1 8:1
4.0 4:1 9:1
3.0 4:1 10:1
3.0 4:1 11:1
4.0 4:1 12:1
3.0 4:1 13:1
5.0 4:1 14:1
3.0 15:1 16:1
4.0 15:1 12:1
3.5 15:1 17:1
3.0 15:1 18:1
2.5 15:1 19:1
Please remove the first line in ratings_small.csv, and use the same command. You will get
2.5 0:1 1:1
4.0 2:1 3:1
5.0 2:1 4:1
5.0 2:1 5:1
4.0 2:1 6:1
4.0 2:1 7:1
3.0 2:1 8:1
3.0 2:1 9:1
4.0 2:1 10:1
3.0 2:1 11:1
5.0 2:1 12:1
3.0 13:1 14:1
4.0 13:1 10:1
3.5 13:1 15:1
3.0 13:1 16:1
2.5 13:1 17:1
In this case, the feature index 0 represents userId 1, the feature index 1 represents movieId 31, the feature index 2 represents userId 2, the feature index 3 represents movieId 10, and so on.
Thank you very much. One more follow up question. Does this script also handle real valued features? I added another feature at the end with random values. It appears that the script is doing a one hot encoding of this column as-well. Is there a way to preserve the real-valued features as-is?
1,31,2.5,1260759144,0.074345836
2,31,4,835355493,0.428518244
2,10,4,835355493,0.144215787
2,17,5,835355681,0.018740053
2,39,5,835355604,0.793609723
2,47,4,835355552,0.62908026
2,50,4,835355586,0.923838115
2,52,3,835356031,0.920521599
2,62,3,835355749,0.549236466
2,110,4,835355532,0.648895353
2,144,3,835356016,0.697152954
2,150,5,835355395,0.752723242
3,60,3,1298861675,0.803889224
3,110,4,1298922049,0.815850633
3,150,4,835355493,0.08505613
3,247,3.5,1298861637,0.268696775
3,267,3,1298861761,0.235652997
3,7153,2.5,1298921787,0.433312402
Output
2.5 0:1 1:1 2:1
4 3:1 1:1 4:1
4 3:1 5:1 6:1
5 3:1 7:1 8:1
5 3:1 9:1 10:1
4 3:1 11:1 12:1
4 3:1 13:1 14:1
3 3:1 15:1 16:1
3 3:1 17:1 18:1
4 3:1 19:1 20:1
3 3:1 21:1 22:1
5 3:1 23:1 24:1
3 25:1 26:1 27:1
4 25:1 19:1 28:1
4 25:1 23:1 29:1
3.5 25:1 30:1 31:1
3 25:1 32:1 33:1
2.5 25:1 34:1 35:1
I guess it doesn't support the real-valued features, so it will be better you write down your own transformation tool.
If you have no idea how to handle it. Maybe you can try this python code: https://github.com/chihming/DataTransformer and the instructions about how to convert the data to your required format: https://github.com/chihming/DataTransformer/wiki/data2sparse *Note that this project has been abandoned**, but it still can meet your requirement.
Thank you for the prompt response/clarification. Appreciate it. I will close this issue for now
I have a very basic query; Is factorization machine designed to work only with binary fields? Do we need to one hot encode all features? How are real-valued featured handled?
Thank you!