Need help with reproducing CORD results

martinkozle commented 2 years ago

I tried to reproduce the CORD results given in the paper, but I only managed to get an F1 score of ~0.62 on the test dataset. Is there any special pre-processing that is done to the CORD dataset for it to work with LiLT or am I making a mistake.

Currently what I am doing is changing the labels in /LiLTfinetune/data/datasets/xfun.py to the labels of the CORD dataset. As well as changing the _generate_examples method to load from the CORD files.

The config that I used:

{
    "model_name_or_path": "models/lilt-infoxlm-base",
    "tokenizer_name": "roberta-base",
    "output_dir": "output/xfun_ser",
    "do_train": "true",
    "do_eval": "true",
    "do_predict": "true",
    "lang": "en",
    "num_train_epochs": 10,
    "max_steps" : 2000,
    "per_device_train_batch_size": 1,
    "warmup_ratio": 0.1,
    "pad_to_max_length": "true",
    "return_entity_level_metrics": "true"
}

Is there another step that needs to be done for LiLT to work with a different dataset? With how many epochs/steps are the results in the paper achieved?

Update: With 20,000 steps I managed to get to an overall F1 score of ~0.79, still far from the expected. With 30,000 steps the score stays at ~0.79, so it is not increasing any more with the number of steps.

jpWang commented 2 years ago

Hi, in your config, it seems that you use roberta-base tokenizer with lilt-infoxlm-base model, which is inconsistent. You need to use the xlm-roberta-base tokenizer. And the per_device_train_batch_size equals to 1 in your config is also too small.

martinkozle commented 2 years ago

Hi, We did more runs, this time with xlm-roberta-base as the tokenizer, per_device_train_batch_size up to 12 (as much as we could with 24GB VRAM). Here is the exact config we tried this time:

{
    "model_name_or_path": "models/lilt-infoxlm-base",
    "tokenizer_name": "xlm-roberta-base",
    "output_dir": "output/xfun_ser",
    "do_train": "true",
    "do_eval": "true",
    "do_predict": "true",
    "lang": "en",
    "num_train_epochs": 10,
    "max_steps" : 20000,
    "per_device_train_batch_size": 12,
    "warmup_ratio": 0.1,
    "pad_to_max_length": "true",
    "return_entity_level_metrics": "true",
    "save_total_limit": 1
}

With this setup we got an f1 score of ~0.828, which is still lower than what you have gotten (0.9616), and lower than all of the other architectures as well. Do you have the exact config that you used for CORD? And did you do any special preprocessing on the dataset?

martinkozle commented 1 year ago

We managed to find the issue and improve the f1 score to ~0.94.

subake commented 1 year ago

Hello! @martinkozle, I faced the same problem. Can you give some tips or share your config? It would be very much appreciated.

martinkozle commented 1 year ago

Hello! @martinkozle, I faced the same problem. Can you give some tips or share your config? It would be very much appreciated.

Sorry for the late response. I doubt that you are having the same issue that we had. Our problem was with words that get converted into multiple tokens, we were only using the first token for the model, instead of all of them.

jpWang / LiLT

Need help with reproducing CORD results #12