sergioburdisso / pyss3

A Python package implementing a new interpretable machine learning model for text classification (with visualization tools for Explainable AI :octocat:)
https://pyss3.readthedocs.io
MIT License
336 stars 44 forks source link

Multilabel Classification Dataset Loading #6

Closed angrymeir closed 4 years ago

angrymeir commented 4 years ago

Hey @sergioburdisso,

for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files. My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.

# Writing Data
dataset = {"Text 1": ["label1", "label2"], 
           "Text 2": ["label2", "label3"], 
           "Text 3": ["label1"]}

for text, labels in dataset.items():

  with open('text.txt', 'a+') as text_file:
    text_file.write(text + '\n')

  with open('labels.txt', 'a+') as label_file:
    label_file.write(';'.join(labels) + '\n')

The result is the following:

# cat text.txt
Text 1
Text 2
Text 3

# cat labels.txt
label1;label2
label2;label3
label1

It would be great if util.Dataset.load_from_files could be adjusted to also support this! But I'm also open for other suggestions on how to tackle that problem :)

Thanks for your hard work!

sergioburdisso commented 4 years ago

Hi @angrymeir! First of all, thanks for being interested in this project. Yeah, I agree that the file structures in the topic categorization tutorial is not well suited to work with multilabel classification, it follows the classic single-label dataset structure. I haven't a lot of previous experience working with multilabel classification. That's one of the main reasons I haven't implemented full support for multilabel classification in the first place. Fortunately, now comes the time to implement full support for multilabel classification.

What do you think having two separate classes for loading datasets from disk? One for "standard" single label dataset (Dataset) and another for multilabel (MultiLabelDataset). For instance, for loading a dataset, we could use MultiLabelDataset.load_from_files.

Do you think we should provide support for another format/structure too?

For instance, having a file holding document name and category label pairs, like so:

doc1 label1
doc1 label2
doc2 label2
doc2 label3
doc3 label1
...

And a folder containing the actual documents. Being this the case, we should let the user specify somehow the file where these pairs are (also provide the separator/delimiter used, tab? comma? etc.) and the path to the folder where actual documents are.

The same should apply to your approach. The user should be able to provide the separator for the labels in labels.txt file, which in your case is a semicolon (;).

What do you think the load_from_files arguments should be? what do you think about this approach:

x_train, y_train = MultiLabelDataset.load_from_files(docs="a file or folder", labels="a file", sep=";")

If docs is a folder then the label file should have a format like the one I described above, if it is a file, it should have your structure. The sep argument is by default "\s" if doc is a folder and ";" if it is a file (or should it be a comma like in a CSV?)

Do you recommend me any particular dataset to work with, while implementing full multilabel support? This dataset will be the one used for the tutorial introducing multilabel support, too, similar to the ones that are already available. I'm currently using a Kaggle's dataset for toxic comment classification.

sergioburdisso commented 4 years ago

I just realized we would need two sep arguments to let the user specify the separator used for labels and also for documents. Since documents containing new lines will be considered as separate documents, so it is better to let the user specify what separator/delimiter was used to indicate where each document begins/ends (although it could be '\n' by default). Something like:

x_train, y_train = MultiLabelDataset.load_from_files(
    docs="the file or folder where the documents are",
    labels="the file containing the labels",
    sep_label="the separator used for labels e.g. ;",
    sep_doc="the separator used for documents e.g. \n"
)

What do you think about that?

angrymeir commented 4 years ago

Hey @sergioburdisso,

MultiLabelDataset.load_from_files vs Dataset.load_mulitlabel_from_file I think for consistency reasons the decision whether to use a different class (MultiLabelDataset) or an additional method (e.g. load_multilabel_from_file) in the class Dataset depends on how multilabel data should be treated in general in the this project.
Would you also create a different class for multilabel evaluation or rather add the functionality to the existing class?

Format/Structure Assuming, that catA corresponds to a combination of labels like:

toxic = -1, sever_toxic=0, obscene=-1, threat=1, insult=-1, identiy_hate=1

This would imply that there were 3^6 possible categories (in the toxic comment dataset) which seems just not feasible to annotate...
Would a combination of both approaches make sense?
Meaning having one file either containing the text or the link to the documents and another file that contains the labels as described in my initial suggestion?

Giving the user the option to specify both delimiters makes absolutely sense! I also agree about the default parameters.

Dataset We're currently working with a parsed version of SemEval 2016 Task 5, I can provide you the dataset if you would like. The challenges with this dataset, are that the number of labels for a given text is in a range of [0..8].

sergioburdisso commented 4 years ago

:blush: Following your suggestion, I've added a method called "load_from_files_multilabel" to carry out this task, supporting both dataset structures/format. I've decided to put "multilabel" at the end so that, as with classify and classify_multilabel, any method XXX related to multilabel will have "_multilabel" as a suffix, this way it will be easier to remember for users (and more consistent).

By catA I meant the label for category A, I'll edit my message to clarify this point (and to match my example with yours).

Now, following your example, you should be able to load your dataset simple by:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt",
    "path/to/labels.txt"
)

In case you need a different separator for labels, for instance, using commas, you could use the sep_label argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_label=","
)

And, finally, in case you need to use a document separator other than '\n', for instance, "\n---\n" you can use the sep_doc argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_doc="\n---\n"
)

More details are given in the API documentation. :+1:

Dataset SemEval 2016 Task 5 sounds cool, feel free to send me the dataset, probably it'll be much better for a tutorial and a Live Demo than the one that I'm using now (toxic comments :poop:).