Closed angrymeir closed 4 years ago
Hi @angrymeir! First of all, thanks for being interested in this project. Yeah, I agree that the file structures in the topic categorization tutorial is not well suited to work with multilabel classification, it follows the classic single-label dataset structure. I haven't a lot of previous experience working with multilabel classification. That's one of the main reasons I haven't implemented full support for multilabel classification in the first place. Fortunately, now comes the time to implement full support for multilabel classification.
What do you think having two separate classes for loading datasets from disk? One for "standard" single label dataset (Dataset
) and another for multilabel (MultiLabelDataset
). For instance, for loading a dataset, we could use MultiLabelDataset.load_from_files
.
Do you think we should provide support for another format/structure too?
For instance, having a file holding document name and category label pairs, like so:
doc1 label1
doc1 label2
doc2 label2
doc2 label3
doc3 label1
...
And a folder containing the actual documents. Being this the case, we should let the user specify somehow the file where these pairs are (also provide the separator/delimiter used, tab? comma? etc.) and the path to the folder where actual documents are.
The same should apply to your approach. The user should be able to provide the separator for the labels in labels.txt file, which in your case is a semicolon (;).
What do you think the load_from_files
arguments should be? what do you think about this approach:
x_train, y_train = MultiLabelDataset.load_from_files(docs="a file or folder", labels="a file", sep=";")
If docs
is a folder then the label file should have a format like the one I described above, if it is a file, it should have your structure.
The sep
argument is by default "\s"
if doc
is a folder and ";"
if it is a file (or should it be a comma like in a CSV?)
Do you recommend me any particular dataset to work with, while implementing full multilabel support? This dataset will be the one used for the tutorial introducing multilabel support, too, similar to the ones that are already available. I'm currently using a Kaggle's dataset for toxic comment classification.
I just realized we would need two sep
arguments to let the user specify the separator used for labels and also for documents. Since documents containing new lines will be considered as separate documents, so it is better to let the user specify what separator/delimiter was used to indicate where each document begins/ends (although it could be '\n'
by default). Something like:
x_train, y_train = MultiLabelDataset.load_from_files(
docs="the file or folder where the documents are",
labels="the file containing the labels",
sep_label="the separator used for labels e.g. ;",
sep_doc="the separator used for documents e.g. \n"
)
What do you think about that?
Hey @sergioburdisso,
MultiLabelDataset.load_from_files
vs Dataset.load_mulitlabel_from_file
I think for consistency reasons the decision whether to use a different class (MultiLabelDataset
) or an additional method (e.g. load_multilabel_from_file
) in the class Dataset
depends on how multilabel data should be treated in general in the this project.
Would you also create a different class for multilabel evaluation or rather add the functionality to the existing class?
Format/Structure
Assuming, that catA
corresponds to a combination of labels like:
toxic = -1, sever_toxic=0, obscene=-1, threat=1, insult=-1, identiy_hate=1
This would imply that there were 3^6 possible categories (in the toxic comment dataset) which seems just not feasible to annotate...
Would a combination of both approaches make sense?
Meaning having one file either containing the text or the link to the documents and another file that contains the labels as described in my initial suggestion?
Giving the user the option to specify both delimiters makes absolutely sense! I also agree about the default parameters.
Dataset We're currently working with a parsed version of SemEval 2016 Task 5, I can provide you the dataset if you would like. The challenges with this dataset, are that the number of labels for a given text is in a range of [0..8].
:blush: Following your suggestion, I've added a method called "load_from_files_multilabel" to carry out this task, supporting both dataset structures/format. I've decided to put "multilabel" at the end so that, as with classify
and classify_multilabel
, any method XXX related to multilabel will have "_multilabel" as a suffix, this way it will be easier to remember for users (and more consistent).
By catA
I meant the label for category A, I'll edit my message to clarify this point (and to match my example with yours).
Now, following your example, you should be able to load your dataset simple by:
x_data, y_data = Dataset.load_from_files_multilabel(
"path/to/text.txt",
"path/to/labels.txt"
)
In case you need a different separator for labels, for instance, using commas, you could use the sep_label
argument as follows:
x_data, y_data = Dataset.load_from_files_multilabel(
"path/to/text.txt", "path/to/labels.txt",
sep_label=","
)
And, finally, in case you need to use a document separator other than '\n'
, for instance, "\n---\n"
you can use the sep_doc
argument as follows:
x_data, y_data = Dataset.load_from_files_multilabel(
"path/to/text.txt", "path/to/labels.txt",
sep_doc="\n---\n"
)
More details are given in the API documentation. :+1:
Dataset SemEval 2016 Task 5 sounds cool, feel free to send me the dataset, probably it'll be much better for a tutorial and a Live Demo than the one that I'm using now (toxic comments :poop:).
Hey @sergioburdisso,
for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files. My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.
The result is the following:
It would be great if
util.Dataset.load_from_files
could be adjusted to also support this! But I'm also open for other suggestions on how to tackle that problem :)Thanks for your hard work!