复现inference输出异常

gingasan / lemon

58 stars 6 forks source link

复现inference输出异常 #8

Closed MiuNa-Yang closed 5 months ago

MiuNa-Yang commented 5 months ago

复现代码如下

from model.ReLM.autocsc import AutoCSCReLM
import torch
from transformers import AutoTokenizer

bert_base_path = '/path/to/bert-base-chinese'
relm_path = '/path/to/relm-m0.3.bin'

tokenizer = AutoTokenizer.from_pretrained(bert_base_path,
                                          use_fast=True,
                                          add_prefix_space=True)

model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path,
                                    state_dict=torch.load(relm_path),
                                    cache_dir="cache")
src = ['发动机故障切纪盲目拆检']
trg = ['发动机故障切忌盲目拆检']
max_seq_length = 50
src_ids = tokenizer(src,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    is_split_into_words=True,  
                    add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    is_split_into_words=True,
                    add_special_tokens=False).input_ids
input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_token_id for _ in trg_ids] + [tokenizer.sep_token_id]
label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
attention_mask = [1] * len(input_ids)
ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
input_ids = torch.tensor([input_ids])
label_ids = torch.tensor([label_ids])
attention_mask = torch.tensor([attention_mask])

print(f'{input_ids=}')
res = model(input_ids, attention_mask, label_ids)
print(f'{res=}')

最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题

gingasan commented 5 months ago

感觉是is_split_into_words=True不对，应该是False，src目前没有分词。

----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.***> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8)

复现代码如下

from model.ReLM.autocsc import AutoCSCReLM
import torch
from transformers import AutoTokenizer

bert_base_path = '/path/to/bert-base-chinese'
relm_path = '/path/to/relm-m0.3.bin'

tokenizer = AutoTokenizer.from_pretrained(bert_base_path,
                                          use_fast=True,
                                          add_prefix_space=True)

model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path,
                                    state_dict=torch.load(relm_path),
                                    cache_dir="cache")
src = ['发动机故障切纪盲目拆检']
trg = ['发动机故障切忌盲目拆检']
max_seq_length = 50
src_ids = tokenizer(src,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    is_split_into_words=True,  
                    add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    is_split_into_words=True,
                    add_special_tokens=False).input_ids
input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_token_id for _ in trg_ids] + [tokenizer.sep_token_id]
label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
attention_mask = [1] * len(input_ids)
ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id]
input_ids = torch.tensor([input_ids])
label_ids = torch.tensor([label_ids])
attention_mask = torch.tensor([attention_mask])

print(f'{input_ids=}')
res = model(input_ids, attention_mask, label_ids)
print(f'{res=}')

最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题

-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

MiuNa-Yang commented 5 months ago

感觉是is_split_into_words=True不对，应该是False，src目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.***>

这个是按照 run里面处理data的方式写的，is_split_into_words置为false也是一样的情况（此时src和trg都输入字符串，是等价的），两种写法的结果是一样的

MiuNa-Yang commented 5 months ago

感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.***>

具体变动如下：

src = '发动机故障切纪盲目拆检'
trg = '发动机故障切忌盲目拆检'
max_seq_length = 50
src_ids = tokenizer(src,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    # is_split_into_words=True,  
                    add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    # is_split_into_words=True,
                    add_special_tokens=False).input_ids

gingasan commented 5 months ago

res = model(input_ids, label_ids, attention_mask)

----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8)

感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.***>

具体变动如下：

src = '发动机故障切纪盲目拆检'
trg = '发动机故障切忌盲目拆检'
max_seq_length = 50
src_ids = tokenizer(src,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    # is_split_into_words=True,  
                    add_special_tokens=False).input_ids
trg_ids = tokenizer(trg,
                    max_length=max_seq_length // 2 - 2,
                    truncation=True,
                    # is_split_into_words=True,
                    add_special_tokens=False).input_ids

-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8#issuecomment-2126308685 You are receiving this because you commented.

Message ID: @.***>

MiuNa-Yang commented 5 months ago

res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.> 具体变动如下： python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids … -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>

感谢提醒，成功复现了结果如下 {'loss': tensor(0.0036, grad_fn=), 'predict_ids': tensor([[ 511, 1355, 1220, 3322, 3125, 7397, 1147, 2555, 4683, 4680, 2858, 3466, 3466, 1355, 1220, 3322, 3125, 7397, 1147, 2555, 4683, 4680, 2858, 3466, 3466]])}

。发动机故障切忌盲目拆检检发动机故障切忌盲目拆检检

是需要再做一些后处理截取片段是嘛，这种做法只能固定纠错后的长度和纠错前一致？

MiuNa-Yang commented 5 months ago

res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.> 具体变动如下： python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids … -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>

看了一下模型代码，似乎在计算loss的时候没有计算特殊token的loss，感觉如果加入special token进行训练可以处理输入输出不等长的情况，不然在推理的时候没办法处理

gingasan commented 5 months ago

嗯嗯有试过让模型学习padding，有一定效果，但不太好。长度限制确实是relm的问题，欢迎优化。

----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期五, 2024年 5 月 24日上午 9:55:20 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8)

res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.> 具体变动如下： python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids … -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.>

-- Reply to this email directly or view it on GitHub: https://github.com/gingasan/lemon/issues/8#issuecomment-2128345882 You are receiving this because you commented.

Message ID: @.***>

MiuNa-Yang commented 5 months ago

嗯嗯有试过让模型学习padding，有一定效果，但不太好。长度限制确实是relm的问题，欢迎优化。 ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期五, 2024年 5 月 24日上午 9:55:20 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) res = model(input_ids, label_ids, attention_mask) ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "gingasan" @.>, "Comment" @.> 发送时间: 星期四, 2024年 5 月 23日下午 2:11:14 主题: Re: [gingasan/lemon] 复现inference输出异常 (Issue #8) 感觉是is_split_into_words=True不对，应该是False，src 目前没有分词。 … ----- 原始邮件 ----- 发件人: "MiuNa-Yang" @.> 收件人: "gingasan/lemon" @.> 抄送: "Subscribed" @.> 发送时间: 星期四, 2024年 5 月 23日上午 11:04:31 主题: [gingasan/lemon] 复现inference输出异常 (Issue #8) 复现代码如下 ```python from model.ReLM.autocsc import AutoCSCReLM import torch from transformers import AutoTokenizer bert_base_path = '/path/to/bert-base-chinese' relm_path = '/path/to/relm-m0.3.bin' tokenizer = AutoTokenizer.from_pretrained(bert_base_path, use_fast=True, add_prefix_space=True) model = AutoCSCReLM.from_pretrained(pretrained_model_name_or_path=bert_base_path, state_dict=torch.load(relm_path), cache_dir="cache") src = ['发动机故障切纪盲目拆检'] trg = ['发动机故障切忌盲目拆检'] max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, is_split_into_words=True, add_special_tokens=False).input_ids input_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + [tokenizer.mask_tokenid for in trg_ids] + [tokenizer.sep_token_id] label_ids = [tokenizer.cls_token_id] + src_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] attention_mask = [1] len(input_ids) ref_ids = [tokenizer.cls_token_id] + trg_ids + [tokenizer.sep_token_id] + trg_ids + [tokenizer.sep_token_id] input_ids = torch.tensor([input_ids]) label_ids = torch.tensor([label_ids]) attention_mask = torch.tensor([attention_mask]) print(f'{input_ids=}') res = model(input_ids, attention_mask, label_ids) print(f'{res=}') ``` 最终输出结果为 {'loss': tensor(nan, grad_fn=), 'predict_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])} 不知道是哪一步有问题 -- Reply to this email directly or view it on GitHub: #8 You are receiving this because you are subscribed to this thread. Message ID: **@.> 具体变动如下： python src = '发动机故障切纪盲目拆检' trg = '发动机故障切忌盲目拆检' max_seq_length = 50 src_ids = tokenizer(src, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids trg_ids = tokenizer(trg, max_length=max_seq_length // 2 - 2, truncation=True, # is_split_into_words=True, add_special_tokens=False).input_ids … -- Reply to this email directly or view it on GitHub: [#8 (comment)](#8 (comment)) You are receiving this because you commented. Message ID: @.> 看了一下模型代码，似乎在计算loss的时候没有计算特殊token的loss，感觉如果加入special token进行训练可以处理输入输出不等长的情况，不然在推理的时候没办法处理 … -- Reply to this email directly or view it on GitHub: #8 (comment) You are receiving this because you commented. Message ID: @.***>

好的，感谢大佬解答，祝科研顺利