咨询albert在huggingface/transformers下的使用

renjunxiang commented 4 years ago

非常感谢您开源这个项目，在使用过程中遇到了一些问题想请教一下。我在之前的项目都是通过冻结BERT+lstm进行后续任务，所以huggingface/transformers中都是直接用

bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0]

这样的方式获取文本经过bert输出的语义。但在使用albert的时候，我发现last_hidden_states每次导入模型都会变化，貌似是个随机数。

bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = AlbertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0]

这样子的话就不是随机数了，每次输出都是固定的。所以想请教一下，如果要使用huggingface模块获取文本通过albert得到的语义，是不是要AlbertModel导入才行。

因为发现了这个现象，我又尝试了roberta。

bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0]

以前都是这么使用的，没有任何问题。

bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = RobertaModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0]

换成这个又会变成随机数，每次载入模型都不一样。

不知您是否了解这是什么原因呢？还是说albert和bert在huggingface中的config不通用，多了embedding到attention中间一个128->hidden_dim的转化，所以不能使用BertModel来载入？

感谢您能帮忙解答！

lonePatient commented 4 years ago

@renjunxiang 你好，你是说AlbertModel.from_pretrained(bert_path)加载进来产生的结果是随机的吗？你是使用我对应的modeling_albert模型文件吗？目前跟huggingface实现的方式不一样，你统一下模型文件和预训练文件加载看看，如果还存在问题的话，加下QQ群836811304（中文预训练模型），方便沟通。

renjunxiang commented 4 years ago

@lonePatient 你好!我用是是谷歌官方的模型文件，通过huggingface中最新版本的convert_albert_original_tf_checkpoint_to_pytorch.py转成的pytorch。我查看其他人提出的issue，您回复可以是通过修改huggingface的方法来实现。

通过AlbertModel.from_pretrained(bert_path)加载没有问题，结果是固定的。通过BertModel.from_pretrained(bert_path)载入有问题，产生的结果是随机的。我之前用过谷歌的bert-base-chinese，哈工大的wwm、wwm-ext、roberta，创新工场的ZEN，都是通过BertModel载入的，所以不确定albert到底该用哪个载入，因为AlbertModel和BertModel都没有报错。

因为BertModel载入albert_base出现结果随机，所以我尝试了roberta。用huggingface的RobertaModel.from_pretrained(bert_path)载入chinese_roberta_wwm_ext，输出也会变成随机。

所以就有点混淆，是不是bert系列都需要BertModel载入预训练模型，albert系列需要AlbertModel载入预训练模型。albert多了一层embedding_hidden_mapping_in，不知道BertModel载入出现随机是不是因为没有这层的命名所以没有载入这层的权重？

RobertaModel的网络结构本质上是BERT，那么RobertaModel这个方法是不是不适用于抽取last_hidden_states，而是用于直接的finetune？

另外再咨询下，如果用AlbertTokenizer.from_pretrained(bert_path)会报错We assumed 'D:/pretrain/pytorch/albert_base/' was a path or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url. 是不是文本转id统一用BertTokenizer.from_pretrained(bert_path)，那个spiece.model是什么呢？

感谢您的解答。

完整代码如下：

import torch
from transformers import BertTokenizer, BertModel
from transformers import RobertaTokenizer,RobertaModel
from transformers import AlbertTokenizer,AlbertModel

device = 'cuda:0'
bert_path = 'D:/pretrain/pytorch/albert_base/'
# tokenizer = AlbertTokenizer.from_pretrained(bert_path) # 会报错
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = AlbertModel.from_pretrained(bert_path) # 输出固定
# BERT = BertModel.from_pretrained(bert_path) # 输出会随机
BERT.eval()
BERT = BERT.to(device)

text_seqs = []
segments_ids = []
text_seq = tokenizer.convert_tokens_to_ids(['[CLS]', '我', '爱', '北', '京', '[SEP]','[PAD]'])
text_seqs.append(text_seq)
segments_ids.append([0] * 7)
text_seqs = torch.LongTensor(text_seqs).to(device)
segments_ids = torch.LongTensor(segments_ids).to(device)

mask_bert = torch.where(text_seqs == 0,
                        torch.zeros_like(text_seqs),
                        torch.ones_like(text_seqs))
with torch.no_grad():
    sentence_features, m = BERT(text_seqs, token_type_ids=segments_ids, attention_mask=mask_bert)
sentence_features = sentence_features[-1]

for i in sentence_features:
    print(i[:4])

lonePatient commented 4 years ago

@renjunxiang 1. albert模型不能用BertModel进行加载，因为模型结构有一些细微变化。

如果你使用的谷歌的chinese版本那么你应该可以使用huggingface的convert和modeling方法和文件。如果你使用的是bright版本的话，应该只能使用本github的模型文件， 3，spiece.model是谷歌albert对英文的使用，而中文他们使用的也是wordpiece，bright也是使用的wordpiece，只用load英文模型时，需要制定spm模型文件，否则制定vocab.文件就可以。
1. 中文版本的robert bert-wwm等本质上都是使用bertModel进行训练的，所以可以加载，但是albert不是BertModel进行训练的。
2. tokenizer这个中文版是一样的，可以使用BertTokenizer

lonePatient commented 4 years ago

个人版本实现的modeling_albert跟huggingface不太一样，可以替换modeling_albert模型文件就可以使用了

renjunxiang commented 4 years ago

非常感谢您耐心解答！

Caleb66666 commented 4 years ago

讨论获益匪浅，尤其是@lonePatient的解答如果是使用英文模型+transformers，可将30-k-clean.model重命名为spiece.model，albert_config.json重名为config.json

lonePatient / albert_pytorch

咨询albert在huggingface/transformers下的使用 #36