microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.21k stars 2.55k forks source link

How to reproduce the zero-shot transfer results of LayoutXLM on XFUN ? #453

Closed jpWang closed 3 years ago

jpWang commented 3 years ago

Thanks for your excellent work - LayoutXLM. And I wonder ask that how to reproduce the zero-shot transfer results of LayoutXLM on XFUN.

I have converted FUNSD into XFUN format and trained LayoutXLM-base following https://github.com/microsoft/unilm/tree/master/layoutxlm#fine-tuning-for-semantic-entity-recognition.

Then, I test the fine-tuned model on Chinese, but the result is:

image

Thanks again for your time and patience.

jpWang commented 3 years ago

And a further question is that how to reproduce the multitask fine-tuning results on XFUN (fine-tuning on 8 languages all, testing on X) ? I trained LayoutXLM-base for 8000 steps, and the result on Chinese is: image

DRRV commented 3 years ago

Hi I've the same issues for both tasks and especially when training with all languages : #440 I've tried with various numbers of steps and LR, but still fail to reproduce

DRRV commented 3 years ago

@jpWang : did you succeed in reproducing the results?

wolfshow commented 3 years ago

@DRRV we were busy for other stuffs recently, will follow-up next week.

DRRV commented 3 years ago

No hurry on my side. thanks!

jpWang commented 3 years ago

@jpWang : did you succeed in reproducing the results?

I have reproduced the results on SER task, but not on RE task so far. So I think I will try more optimization strategies on RE and closed this issue.

The reason why I didn't reproduce it before is that there are some bugs in the dataset construction part of my code. Finally I changed the following lines in xfun.py:

                tokenized_inputs = self.tokenizer(
                    line["text"],
                    add_special_tokens=False,
                    return_offsets_mapping=True,
                    return_attention_mask=False,
                )

into :

                if '/en' in filepath[0]:
                    tokenized_inputs = self.tokenizer(
                        ' '.join([q['text'] for q in line['words']]),
                        add_special_tokens=False,
                        return_offsets_mapping=True,
                        return_attention_mask=False,
                    )
                else:
                    tokenized_inputs = self.tokenizer(
                        line["text"],
                        add_special_tokens=False,
                        return_offsets_mapping=True,
                        return_attention_mask=False,
                    )

after converting FUNSD into XFUN format, since there are some missing words in line['words'] compared with line["text"]. I just follow the official optimization strategy on SER task to reproduce the zero-shot transfer results, and change the max steps to 8000 to reproduce the multitask fine-tuning results.

congphu2511995 commented 2 years ago

@jpWang How did you convert FUNSD into XFUN format? I tried to convert but didn't succeed.

jpWang commented 2 years ago

@jpWang How did you convert FUNSD into XFUN format? I tried to convert but didn't succeed.

You can access the data from https://github.com/jpWang/LiLT#datasets.

mengxj08 commented 2 years ago

@jpWang : did you succeed in reproducing the results?

I have reproduced the results on SER task, but not on RE task so far. So I think I will try more optimization strategies on RE and closed this issue.

The reason why I didn't reproduce it before is that there are some bugs in the dataset construction part of my code. Finally I changed the following lines in xfun.py:

                tokenized_inputs = self.tokenizer(
                    line["text"],
                    add_special_tokens=False,
                    return_offsets_mapping=True,
                    return_attention_mask=False,
                )

into :

                if '/en' in filepath[0]:
                    tokenized_inputs = self.tokenizer(
                        ' '.join([q['text'] for q in line['words']]),
                        add_special_tokens=False,
                        return_offsets_mapping=True,
                        return_attention_mask=False,
                    )
                else:
                    tokenized_inputs = self.tokenizer(
                        line["text"],
                        add_special_tokens=False,
                        return_offsets_mapping=True,
                        return_attention_mask=False,
                    )

after converting FUNSD into XFUN format, since there are some missing words in line['words'] compared with line["text"]. I just follow the official optimization strategy on SER task to reproduce the zero-shot transfer results, and change the max steps to 8000 to reproduce the multitask fine-tuning results.

@jpWang Same here. I still can't reproduce on RE task so far (I make it on SER). Do you find any more optimization strategies on RE later? Thanks in advance.