shibing624 / pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,Qwen2.5等模型应用在纠错场景,开箱即用。
https://www.mulanai.com/product/corrector/
Apache License 2.0
5.61k stars 1.1k forks source link

macbert预训练问题 #323

Closed banbsyip closed 10 months ago

banbsyip commented 2 years ago

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.

To Reproduce

Please provide a Minimal, Complete, and Verifiable example here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.

Describe your attempts

You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).

Context

-linux: -CPU:

In addition, figure out your version by running import pycorrector; pycorrector.__version__.0.4.6 我按照Macbert的readme在跑train.py代码后训练的效果很差,看之前的issue说是没有加载预训练的模型,请问下预训练的模型要怎么下载?看了macbert的readme并没有这块的提示,并没有相关解释。 训练效果:Sentence Level: acc:0.591818, precision:0.674074, recall:0.335175, f1:0.447724 /macbert/output/macbert4csc/config.json: { "_name_or_path": "hfl/chinese-macbert-base", "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "directionality": "bidi", "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.21.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 21128 } 运行日志: 2022-09-19 16:54:50,036 - mainModule - INFO - Namespace(config_file='train_macbert4csc.yml', opts=[]) 2022-09-19 16:54:50,037 - mainModule - INFO - Loaded configuration file train_macbert4csc.yml 2022-09-19 16:54:50,037 - mainModule - INFO - MODEL: BERT_CKPT: "hfl/chinese-macbert-base" DEVICE: "cuda" NAME: "macbert4csc" GPU_IDS: [0]

[loss_coefficient]

HYPER_PARAMS: [0.3]

WEIGHTS: "output/macbert4csc/epoch=6-val_loss=0.07.ckpt"

WEIGHTS: ""

DATASETS: TRAIN: "output/train.json" VALID: "output/dev.json" TEST: "output/test.json"

SOLVER: BASE_LR: 5e-5 WEIGHT_DECAY: 0.01 BATCH_SIZE: 32 MAX_EPOCHS: 5 ACCUMULATE_GRAD_BATCHES: 4

OUTPUT_DIR: "output/macbert4csc" MODE: ["train", "test"]

2022-09-19 16:54:50,037 - mainModule - INFO - Running with config: DATALOADER: NUM_WORKERS: 4 DATASETS: TEST: output/test.json TRAIN: output/train.json VALID: output/dev.json INPUT: MAX_LEN: 512 MODE: ['train', 'test'] MODEL: BERT_CKPT: hfl/chinese-macbert-base DEVICE: cuda GPU_IDS: [0] HYPER_PARAMS: [0.3] NAME: macbert4csc NUM_CLASSES: 10 WEIGHTS: OUTPUT_DIR: output/macbert4csc SOLVER: ACCUMULATE_GRAD_BATCHES: 4 BASE_LR: 5e-05 BATCH_SIZE: 32 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 10 DELAY_ITERS: 0 ETA_MIN_LR: 3e-07 GAMMA: 0.9999 INTERVAL: step LOG_PERIOD: 100 MAX_EPOCHS: 5 MAX_ITER: 10 MOMENTUM: 0.9 OPTIMIZER_NAME: AdamW SCHED: WarmupExponentialLR STEPS: (10,) WARMUP_EPOCHS: 1024 WARMUP_FACTOR: 0.01 WARMUP_ITERS: 2 WARMUP_METHOD: linear WEIGHT_DECAY: 0.01 WEIGHT_DECAY_BIAS: 0 TASK: NAME: CSC TEST: BATCH_SIZE: 8 CKPT_FN: 2022-09-19 16:54:50,037 - mainModule - INFO - load model, model arch: macbert4csc 2022-09-19 16:55:01,481 - mainModule - INFO - train model ... 2022-09-19 17:29:59,164 - mainModule - INFO - train model done.

shibing624 commented 2 years ago

没看懂啥问题。使用shibing624/macbert4csc-base-chinese 不行么

banbsyip commented 2 years ago

没看懂啥问题。使用shibing624/macbert4csc-base-chinese 不行么

我是想复现下在SIGHAN2015数据集训练的模型能不能达到github上说的效果,结果训练以后效果差好多,而且不是很清楚如何在已有模型上继续训练,还有[shibing624/macbert4csc-base-chinese]模型在哪里能下载,刚才看了下你发的链接,上面有个409M模型文件就是这个模型吧

banbsyip commented 2 years ago

没看懂啥问题。使用shibing624/macbert4csc-base-chinese 不行么

还有一个问题,就是macbert4csc-base-chinese里并没有ckpt文件,在运行infer.py不会对预测有影响吗?

shibing624 commented 2 years ago

不会

banbsyip commented 2 years ago

不会

我这边用自有的asr语音文本,训练的结果recall和acc都是4.1%,很不解为什么这么低,也暂时找不到问题出在哪

shibing624 commented 2 years ago

看下case查原因呗。

1.自有数据一般量小质量高,一般需要补充 https://github.com/shibing624/pycorrector#Dataset 数据,样本量大,模型拟合充分;

  1. 查badcase分析下原因,check下样本数据集是否有错误的,有就改下。
  2. 如果训练样本少,直接用规则搞更方便。
banbsyip commented 2 years ago

看下case查原因呗。

1.自有数据一般量小质量高,一般需要补充 https://github.com/shibing624/pycorrector#Dataset 数据,样本量大,模型拟合充分; 2. 查badcase分析下原因,check下样本数据集是否有错误的,有就改下。 3. 如果训练样本少,直接用规则搞更方便。

我这边是客服语音asr转文本后纠错,量比较大,我训练数据train只用了10万条,用阿里的asr和内部的asr数据,用阿里的asr作为label,之前是致用不一致的错误数据作为train,test,vlid数据,现在加入了一致的数据到三个数据集中,170万条数据,错误语句和正确语句比例为7:10

suchunxie commented 2 years ago

@shibing624 老师您好 , 我想训练一个外语的MacBert4csc模型但是nlp方面没什么经验,想向您请教几点问题。 1.训练步骤的确认。 1-a. 收集数据集处理格式, 应用LTP分词,且在MLM任务上把[mask]用近义词及n-gram替换。 1-b. 使用google官方的pretraining_data.py生成pretraining data, 进行训练, 从而生成macbert模型。 1-c. 用上文生成的macbert, 收集外语的纠错数据集,按仓库首页 Read Me 中的说明去训练。 如果步骤1的理解有错误还希望您得到您的指正。

  1. 1-a的的脚本是否有公开呢,想知道在哪儿可以参考。

百忙之中占用您的时间非常抱歉,也非常感激。

shibing624 commented 2 years ago

1-a,1-b步骤可以省略,用hfl/chinese-macbert-base就行。

1-a的脚步可以参考 https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py ,具体到macbert的模型结构参考 https://github.com/ymcui/MacBERT 改下。

suchunxie commented 2 years ago

谢谢老师回复! 还有一个疑问,我想训的是日语版的, 试用了下hfl/chinese-macbert-base对日语似乎不具有预测能力,这种情况也可以用hfl/chinese-macbert-base吗,还是1-a,b需要重新拿日语做一遍呀?

shibing624 commented 2 years ago

用日语模型,或者多语言版的bert模型:bert-base-multilingual-cased

suchunxie commented 2 years ago

谢谢老师! 直接用现有的日语版的模型,之后收集外语的纠错数据集,按首页方法训练就可以吗? 这样的话感觉没有做“[mask]用近义词及n-gram替换”这一步,请问这一步是不必须的吗

shibing624 commented 2 years ago

1、可以 2、不必须,效果损失不大

suchunxie commented 2 years ago

感谢老师百忙之中耐心解答! 那我按这个去试一下回来反馈!

wnntju commented 1 year ago

看下case查原因呗。 1.自有数据一般量小质量高,一般需要补充 https://github.com/shibing624/pycorrector#Dataset 数据,样本量大,模型拟合充分; 2. 查badcase分析下原因,check下样本数据集是否有错误的,有就改下。 3. 如果训练样本少,直接用规则搞更方便。

我这边是客服语音asr转文本后纠错,量比较大,我训练数据train只用了10万条,用阿里的asr和内部的asr数据,用阿里的asr作为label,之前是致用不一致的错误数据作为train,test,vlid数据,现在加入了一致的数据到三个数据集中,170万条数据,错误语句和正确语句比例为7:10

我也是做的asr文本纠错,请问你目前使用macbert纠错后CER有降低吗?方便加微信交流下吗?方便的话请邮箱联系下我哈~

lrs01 commented 1 year ago

您好,我按照readme中的步骤进行预训练,只有10个epoch,最后output中的macbert4csc文件中是空的,模型似乎没有保存下来。烦请解答,谢谢

shibing624 commented 1 year ago

数据量太少了吗?按理说每个epoch都有保存模型权重的。

lrs01 commented 1 year ago

谢谢解答。确实存在数据量较少的问题,数据量可能只有上千条,是按照train.json的格式准备,但是在python train.py后只会跑10个epo就会停止,output/macbert相关文件夹内没有保存任何东西。

shibing624 commented 1 year ago

先用测试数据跑下,没问题再上自己的数据集,num_epochs可以改。

lrs02 commented 1 year ago

先用测试数据跑下,没问题再上自己的数据集,num_epochs可以改。 好的,谢谢

lrs02 commented 1 year ago

您好,我在fine-tuned结束后,开始预测功能时(python infer.py)出现错误如下:ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

shibing624 commented 1 year ago

用BertTokenizer 替换BertTokenizerFast

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)