terryqj0107 / RiSAWOZ

Datasets and codes for the paper "RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling". (EMNLP 2020)
MIT License
58 stars 9 forks source link

关于数据预处理的bug #5

Closed Lavender0225 closed 1 year ago

Lavender0225 commented 1 year ago

hi,我发现了数据预处理的一个Bug。

image

usr_intents的append只有在else里有,if里没有,所以就会有些Intent被漏掉。看处理出来的数据也是有漏掉的情况的。

还有,我有个疑问哈,NLU部分的数据是不是只保留usr的intent和slot就可以了,应该不需要sys的吧?代码里也保留了sys的,是出于什么考虑呢?

Lavender0225 commented 1 year ago

preprocess_RiSAWoz.py中的

Lavender0225 commented 1 year ago

搞明白了,不是bug。crossWoz的论文里写的非常清楚。 "For dialogue acts of inform and recommend intents such as (intent=Inform, domain=Attraction, slot=fee, value=free) whose values appear in the sentence, we perform sequen- tial labeling using an MLP which takes word em- beddings ("free") as input and outputs tags in BIO schema ("B-Inform-Attraction-fee").

For each of the other dialogue acts (e.g., (intent=Request, do- main=Attraction, slot=fee)) that do not have ac- tual values, we use another MLP to perform bi- nary classification on the sentence representation to predict whether the sentence should be labeled with this dialogue act."

所以intent里面没有inform和request,因为inform和request是通过BIO序列标注来处理的。