ireneisdoomed / stopReasons

Analysis on stop reasons
Apache License 2.0
0 stars 0 forks source link

Multi label full fine tuned model is not accurate #1

Closed ireneisdoomed closed 1 year ago

ireneisdoomed commented 1 year ago

After training the multi label on the whole dataset with 7 epochs, the classificator seems to be off most of the times, returning un significant results.

Initially I thought this behaviour was due to the fact that the testing version used a smaller set, but this is not the case.

Since I first implemented this in September, a couple of things have changed:

It is striking that the simple version of the Sept version performs much better. Tomorrow I will first rerun this pipeline but keeping the external dependencies changed (metadata and dataset), to make sure that the problem is indeed in the model.

ireneisdoomed commented 1 year ago

Update: old model has been retrained now—same parameters and evaluation metric as September.

Probabilities are not exactly the same, but very similar and reasonable given the amount of data and number of epochs. I think this confirms the problem is in process of migrating to a multi label class. So testing this will be my next steps, particularly the evaluation metric and loss function.

Metrics on the whole gold standard*:

{'accuracy': 0.5404376784015223, 'total_time_in_seconds': 814.9569559809752, 'samples_per_second': 5.158554656349409, 'latency_in_seconds': 0.1938527488061311}

*I wanted to calculate F1, but couldn't make HF's evaluate with the multi class. ValueError: Target is multiclass but average='binary'. Please choose another average setting. I'll tackle this later.

These have been the metrics throughout the training:

image

Some examples:

Text: due to covid crisis the study had to be halted
Top 3 classes:
... Class: Covid19
... Probability: 0.21742285788059235
--------------------
... Class: Business_Administrative
... Probability: 0.12428395450115204
--------------------
... Class: Study_Design
... Probability: 0.09903866797685623

Text: 2011 Thailand flooding led to loss of GMP pharmacy, project delays, and further regulatory \u00ca challenges.
Top 3 classes:
... Class: Business_Administrative
... Probability: 0.3845737874507904
--------------------
... Class: Negative
... Probability: 0.06874296069145203
--------------------
... Class: Logistics_Resources
... Probability: 0.0674576535820961

I had to do a workaround in order to use our current dataset version, which collects all the labels into an array, to adapt the data to the one label-only model.

This is the script, data is subsetted (train=500, test=50) and 3 epochs.

By looking at the Datasets docs, I think I could have applied the logic to explode the data with their map function. But exporting to pandas and loading into a Dataset again seemed more straightforward.

def explode_label_columns(dataset, label2id):
   "Function to reproduce the Dataset format before moving to a multi-label task."
    ds = DatasetDict()
    for split in ["train", "test"]:
        pdf = dataset[split].to_pandas().explode("label_descriptions").rename({"label_descriptions": "label_description"}, axis=1).reset_index()
        pdf["label"] = pdf["label_description"].map(label2id)
        pdf = pdf[["text", "label", "label_description"]]
        good_dataset = Dataset.from_pandas(pdf, preserve_index=False)
        ds[split] = good_dataset
    return ds
ireneisdoomed commented 1 year ago

I have produced a new model, multi label and trained using Pytorch with the Trainer that would mimic the one mentioned above (500 samples, 3 epochs).

It is uploaded here: https://huggingface.co/ireneisdoomed/stop_reasons_classificator_multilabel_pt_500n_3epochs

However, unfortunately, I still find it inaccurate... I'll need to think tomorrow about what I can do about it. On the bright side, it has been extremely fast to train (maybe too fast?). In ~10 minutes I had trained, evaluated, and pushed it.

ireneisdoomed commented 1 year ago

I have good results now!!!! The problem was that in this case the small subset and number of epochs were not enough to have good results. On the full set, Im getting sensible predictions. API available at https://huggingface.co/ireneisdoomed/stop_reasons_classificator_multilabel_pt?text=pharmacokinetics+was+not+satisfactory