如何利用此预训练模型的encoder提取任意三元组或者句子的embedding？

txsun1997 / CoLAKE

COLING'2020: CoLAKE: Contextualized Language and Knowledge Embedding

https://aclanthology.org/2020.coling-main.327/

MIT License

114 stars 17 forks source link

如何利用此预训练模型的encoder提取任意三元组或者句子的embedding？ #12

Closed Einstone-rose closed 3 years ago

Einstone-rose commented 3 years ago

比如我现在有这样一句话，The weather is rainy and we need umbrella. 是否可以利用你提供的预训练模型的encoder得到每一个单词和整个句子的embeddings呢？可以提供一些具体的指导吗，十分感谢！

txsun1997 commented 3 years ago

可以的，使用方式和hugging face的transformers一样，只不过要把预训练的权重换成colake的

Einstone-rose commented 3 years ago

可以的，使用方式和hugging face的transformers一样，只不过要把预训练的权重换成colake的

谢谢及时的回复。因为刚接触Bert+KG相关的东西可能问题比较多。我注意到有个地方，在代码里config = RobertaConfig.from_pretrained('roberta-base', type_vocab_size=3)，这里的type_vocab_size=3, Bert里是2，RoBerta里1（去掉了NSP），我不是很懂这里为何是3. 我的理解是：因为你是使用wikipedia数据中三元组(head, rel, tail)训练的模型，所以为了区分head，rel和tail所以才使用的3个标记是吗？另外，当我导入model.bin以后，现在我如果有一个句子“The weather is rainy and we need umbrella”想要输入模型得到对应的embedding，模型预处理是怎么将其转化为模型要求的输入格式呢，希望可以提供指导，十分感谢。

txsun1997 commented 3 years ago

type_vocab_size=3是因为有三类节点：word, entity, relation.
关于怎么把句子转换为模型输入可以参考hugging face的doc：https://huggingface.co/transformers. 一个example:
```
>>> from transformers import RobertaTokenizer, RobertaModel
>>> import torch
```

tokenizer = RobertaTokenizer.from_pretrained('roberta-base') model = RobertaModel.from_pretrained('./colake_model_path')

inputs = tokenizer("The weather is rainy and we need umbrella", return_tensors="pt") outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

GX77 commented 3 years ago

您好，有个问题想请教一下。您给的例子里面的 inputs 就是模型的输入吗？输入不应该是由Token Embeddings+Type Embeddings+Position Embeddings得到吗?

txsun1997 commented 3 years ago

可以只提供token id，type和position会按照默认补全