[Question] AdaSeq for Sequence Labeling, How are the metrics calculated? What are the available models?

iR00i commented 1 year ago

What is your question?

Hello, I am working on the SemEval2023 MultiCoNER-II task.

First of all thank you for sharing this amazing repo, its saving me a lot of time and effort.

With regards to the metrics, I was training an xlm-robert-laarge model on the English dataset and noticed the F1 score was low but accuracy was high (F1=0.38 and accuracy = ~0.8). If you have taken a look at the English dataset for the MultiCoNER-II you'll see that the Other tag (aka 'O') is more frequent in the dataset than any other tag by a large margin. Hence its possible that the model(s) may overfit and just start predicting the tag O for most tokens in a sequence.

My question is, When calculating the metrics, do you take into account the O tags into the calculation? in other words, do you mask the tokens in the target sequence whose gold tag is O when calculating the loss/accuracy/F1?

My next question has to do with the possible configurations we can control. What are the models (transformers) that we can use?

What have you tried?

multiconer2-en-exp#1.zip The attached file contains the configuration I used. The model was early stopped at epoch=7

Code (if necessary)

No response

What's your environment?

AdaSeq Version (e.g., 1.0 or master): 0.5.0
ModelScope Version (e.g., 1.0 or master): 1.1.1
PyTorch Version (e.g., 1.12.1): 1.12.1+cu102
OS (e.g., Ubuntu 20.04): Linux-5.10.147+-x86_64-with-glibc2.27 in Google colab
Python version: 3.8.16
CUDA/cuDNN version: 11.2
GPU models and configuration: Tesla T4 15109MiB
Any other relevant information: Used on Google Colab

Code of Conduct

[X] I agree to follow this project's Code of Conduct

huangshenno1 commented 1 year ago

Hello, @iR00i , we are very happy that you use AdaSeq.

For your 1st question: For sequence labeling tasks, we just use seqeval to calculate the metrics: acc, p, r, mi-f1, ma-f1. Its evaluation method is the same as conlleval. So when calculating accuracy, O is taken into account. However, when calculating F1, O is not.

For your 2nd question: You can use basically ALL transformers from huggingface modelhub. Just copy the model_id to model.embedder.model_name_or_path in the configuration file. e.g.

model:
  type: sequence-labeling-model
  embedder:
    model_name_or_path: dslim/bert-base-NER
  dropout: 0.1
  use_crf: true

Finally, some suggestions for you to get higher F1: We found that without further information, it's hard to train a stable model with the original MultiCoNER-II datasets as the context is short and the entity classes are fine-grained. You need to tune very very carefully. You can try the retrieval-augmented dataset we provided, which can produce a much better result.

t2413 commented 1 year ago

I have the similar question when simply using the code like 'python scripts/train.py -c examples/bert_crf/configs/conllpp.yaml ', but test result showed that F1 was really low while accuracy was high. I have tried a few yamls, it's the same problem. Why is this？really appreciate it, thanks！

huangshenno1 commented 1 year ago

I have the similar question when simply using the code like 'python scripts/train.py -c examples/bert_crf/configs/conllpp.yaml ', but test result showed that F1 was really low while accuracy was high. I have tried a few yamls, it's the same problem. Why is this？really appreciate it, thanks！

Hello, @t2413 .

I just ran the same training code on conllpp with master branch, python=3.7 torch=1.12.1 modelscope=1.3.0. The results seemed reasonable:

test: {
  "precision": 0.9450856942987058,
  "recall": 0.947737635917222,
  "f1": 0.9464098073555166,
  "accuracy": 0.9869494993000969
}

I have no idea what the reason is for your bad performance. Maybe you can check if your requirements are installed correctly?

modelscope / AdaSeq