google-research / lasertagger

Apache License 2.0
606 stars 91 forks source link

How can I use this model on Chinese dataset? #8

Open f617452296 opened 4 years ago

f617452296 commented 4 years ago

And can this model be helpful on Chinese dataset?

ekQ commented 4 years ago

We haven't looked into this, but you could try it using BERT-Base, Chinese to initialize the model.

qiuhuiGithub commented 4 years ago

I test the model on chinese GEC task and it works fine.

f617452296 commented 4 years ago

May I have your email to ask some question?

------------------ Original ------------------ From: qiuhuiGitHub <notifications@github.com> Date: Wed,Mar 18,2020 1:50 PM To: google-research/lasertagger <lasertagger@noreply.github.com> Cc: fishfang <617452296@qq.com>, Author <author@noreply.github.com> Subject: Re: [google-research/lasertagger] How can I use this model on Chinese dataset? (#8)

I test the model in chinese GEC task and it works fine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

qiuhuiGithub commented 4 years ago

May I have your email to ask some question? ------------------ Original ------------------ From: qiuhuiGitHub <notifications@github.com> Date: Wed,Mar 18,2020 1:50 PM To: google-research/lasertagger <lasertagger@noreply.github.com> Cc: fishfang <617452296@qq.com>, Author <author@noreply.github.com> Subject: Re: [google-research/lasertagger] How can I use this model on Chinese dataset? (#8) I test the model in chinese GEC task and it works fine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

U can use any chinese bert model by simple replace the bert path and it works fine.

ekQ commented 4 years ago

I test the model on chinese GEC task and it works fine.

Good to know this!

varepsilon commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

qiuhuiGithub commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

f617452296 commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

qiuhuiGithub commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

First I suggest you read the run_wikisplit_experiment.sh in the project. You can simple run lasertagger by changing the script. Here is a example.

f617452296 commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

First I suggest you read the run_wikisplit_experiment.sh in the project. You can simple run lasertagger by changing the script. Here is a example.

  • You should change your data into wikisplit format, such as "I like you \t I love you".
  • Change the all the Path in the script to your's.
  • Change vocab_size in configs/lasertagger_config.json because the vocab_size is different in Chinese BERT.
  • Run the script step by step. Best wishes.

It helps a lot! Thank you!

f617452296 commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php need to be seg into words or just I can send a whole sentence into the model? Thanks!

qiuhuiGithub commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php need to be seg into words or just I can send a whole sentence into the model? Thanks!

eh, the input of the chinese bert is separate word, so you should cut the sentence into separate word.

Ivy-C-85 commented 4 years ago

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php need to be seg into words or just I can send a whole sentence into the model? Thanks!

eh, the input of the chinese bert is separate word, so you should cut the sentence into separate word.

Hi, I also tested GEC task. But my model didn't work well, it didn't actually 'correct', it just delete every difference and even some same parts between source and target texts. I use JIEBA to cut my sentences and I thought everything was done just fine, only the results were pretty bad. Could you please tell me did you have the same problem and which score did you use?