ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11k stars 1.18k forks source link

Multi-label classification #1265

Open lrocholl opened 2 years ago

lrocholl commented 2 years ago

I'm running some experiments using multi-label classification of movies in one or more genres based on their plot.

My model definition is the following: model_definition = { 'input_features':[ {'name':'plot', 'type':'text', 'level': 'word', 'encoder': 'parallel_cnn'} ], 'output_features': [ {'name': 'genre_new', 'type': 'set'} ] }

My training dataset looks like this: image

However, my predictions are not looking really good: image

Is there anything I am missing here? I understand that the format of the set column might be influencing the results but not sure if this is the right approach.

I would appreciate your comments. Thanks.

w4nderlust commented 2 years ago

@lrocholl thank you for asking this question. I believe the issue steams from the way you are providing the multiple labels. When yo uspecify type to be set in Ludwig, the current expectation is that a string is provided for each row with the set of classes expresses as a whitespace separated list. So, instead of ['Short Film', 'Documentary'] it shoudl look like 'Short_Film Documentary'. Try to do it this way and let me know if it works. This anyway also suggests that we may want to introduce some flexibility in the way sets are provided, maybe we should accept both lists and sets of strings other than whitespace-separated strings.