finetune的数据格式是否是官方的格式，能否直接提供一下，否则个人获取不太方便

sinovation / ZEN

A BERT-based Chinese Text Encoder Enhanced by N-gram Representations

Apache License 2.0

642 stars 104 forks source link

finetune的数据格式是否是官方的格式，能否直接提供一下，否则个人获取不太方便 #8

Closed 597477803 closed 4 years ago

597477803 commented 5 years ago

python run_token_level_classification.py \ --task_name cwsmsra \ --do_train \ --do_eval \ --do_lower_case \ --data_dir data/msra_ner \ --bert_model data/ZEN_pretrain_base_v0.1.0 \ --max_seq_length 256 \ --do_train \ --do_eval \ --train_batch_size 96 \ --num_train_epochs 30 \ --warmup_proportion 0.1

比如，想进行上面的finetune，但是这个任务cwsmsra，使用的训练数据格式应该是怎样的，从哪里能比较方便获取到？

GuiminChen commented 5 years ago

The MSRA dataset for CWS is available at the official website (http://sighan.cs.uchicago.edu/bakeoff2005/), and due to the copyright, we could not provide it to you. For your reference, the data format is like this: 扬 B 帆 E 远 B 东 E 做 S 与 S 中 B 国 E 合 B 作 E 的 S 先 B 行 E

597477803 commented 4 years ago

http://sighan.cs.uchicago.edu/bakeoff2005/

3ks