Closed MiuNa-Yang closed 5 months ago
感觉是is_split_into_words=True不对,应该是False,src目前没有分词。
----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.***> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8)
复现代码如下
from model.ReLM.autocsc import AutoCSCReLM
import torch
from transformers import AutoTokenizer
bert_base_path = '/path/to/bert-base-chinese'
relm_path = '/path/to/relm-m0.3.bin'
tokenizer = AutoTokenizer.from_pretrained(bert_base_path,
use_fast=True,
add_prefix_space=True)
model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path,
state_dict=torch.load(relm_path),
cache_dir="cache")
src = ['发动机故障切纪盲目拆检']
trg = ['发动机故障切忌盲目拆检']
max_seq_length = 50
src_ids = tokenizer(src,
max_length=max_seq_length // 2 - 2,
truncation=True,
is_split_into_words=True,
add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
max_length=max_seq_length // 2 - 2,
truncation=True,
is_split_into_words=True,
add_special_tokens=False).input_ids
input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_token_id for _ in trg_ids] + [tokenizer.sep_token_id]
label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
attention_mask = [1] * len(input_ids)
ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
input_ids = torch.tensor([input_ids])
label_ids = torch.tensor([label_ids])
attention_mask = torch.tensor([attention_mask])
print(f'{input_ids=}')
res = model(input_ids, attention_mask, label_ids)
print(f'{res=}')
最终输出结果为
{'loss': tensor(nan, grad_fn=
-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8 You are receiving this because you are subscribed to this thread.
Message ID: @.***>
感觉是is_split_into_words=True不对,应该是False,src目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .***>
这个是按照 run里面处理data的方式写的,is_split_into_words置为false也是一样的情况(此时src和trg都输入字符串,是等价的),两种写法的结果是一样的
感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .***>
具体变动如下:
src = '发动机故障切纪盲目拆检'
trg = '发动机故障切忌盲目拆检'
max_seq_length = 50
src_ids = tokenizer(src,
max_length=max_seq_length // 2 - 2,
truncation=True,
# is_split_into_words=True,
add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
max_length=max_seq_length // 2 - 2,
truncation=True,
# is_split_into_words=True,
add_special_tokens=False).input_ids
res = model(input_ids, label_ids, attention_mask)
----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日 下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8)
感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .***>
具体变动如下:
src = '发动机故障切纪盲目拆检'
trg = '发动机故障切忌盲目拆检'
max_seq_length = 50
src_ids = tokenizer(src,
max_length=max_seq_length // 2 - 2,
truncation=True,
# is_split_into_words=True,
add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
max_length=max_seq_length // 2 - 2,
truncation=True,
# is_split_into_words=True,
add_special_tokens=False).input_ids
-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8#issuecomment-2126308685 You are receiving this because you commented.
Message ID: @.***>
res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日 下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .> 具体变动如下:python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids
… -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>
感谢提醒,成功复现了
结果如下
{'loss': tensor(0.0036, grad_fn=
。 发 动 机 故 障 切 忌 盲 目 拆 检 检 发 动 机 故 障 切 忌 盲 目 拆 检 检
是需要再做一些后处理截取片段是嘛,这种做法只能固定纠错后的长度和纠错前一致?
res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日 下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .> 具体变动如下:python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids
… -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>
看了一下模型代码,似乎在计算loss的时候没有计算特殊token的loss,感觉如果加入special token进行训练可以处理输入输出不等长的情况,不然在推理的时候没办法处理
嗯嗯有试过让模型学习padding,有一定效果,但不太好。长度限制确实是relm的问题,欢迎优化。
----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期五, 2024年 5 月 24日 上午 9:55:20 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8)
res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日 下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .> 具体变动如下:python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids
… -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>
看了一下模型代码,似乎在计算loss的时候没有计算特殊token的loss,感觉如果加入special token进行训练可以处理输入输出不等长的情况,不然在推理的时候没办法处理
-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8#issuecomment-2128345882 You are receiving this because you commented.
Message ID: @.***>
嗯嗯有试过让模型学习padding,有一定效果,但不太好。长度限制确实是relm的问题,欢迎优化。 ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期五, 2024年 5 月 24日 上午 9:55:20 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日 下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对,应该是False,src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日 上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=
), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@ .> 具体变动如下:python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids
… -- Reply to this email directly or view it on GitHub: [#8 (comment)](#8 (comment)) You are receiving this because you commented. Message ID: @.> 看了一下模型代码,似乎在计算loss的时候没有计算特殊token的loss,感觉如果加入special token进行训练可以处理输入输出不等长的情况,不然在推理的时候没办法处理 … -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.***>
好的,感谢大佬解答,祝科研顺利
复现代码如下
最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0]])}
不知道是哪一步有问题