Dataset - Githubissues

wjczf123 commented 3 years ago

Hi, I notice that this paper uses CLINC and BANKING dataset. Your previous work (Discovering new intents via constrained deep adaptive Clustering with Cluster Refinement) uses SNIPS, DBPedia, StackOverflow dataset. It seems that this two studies study the same task? And what is the benchmark dataset which used in the future from your perspective?

HanleiZhang commented 3 years ago

Thanks for your valuable question~ We recommend you to use CLINC and BANKING as benchmark datasets. Our first work CDAC+[1] adopts SNIPS, DBPedia, and StackOverflow as evaluated datasets. However, the number of the intent categories of these datasets is too small (7, 14, and 20). Moreover, the taxonomy is relatively easy to discriminate with few semantic-similar intents. After that, CLINC [2] (EMNLP 2019) and BANKING [3] (ACL 2020) are proposed as new dialogue benchmarks as summarized in DialoGlue [4]. We find these two datasets are more appropriate with more plenty of intent categories (77 and 150 respectively). They are more challenging and close to real scenarios. Therefore, we recommend CLINC and BANKING as two of the benchmark datasets.

[1] Ting-En Lin, Hua Xu and Hanlei Zhang. 2020. Discovering New Intents via Constrained Deep Adaptive Clustering with Cluster Refinement. In Proceedings of AAAI 2020. [2] Larson Stefan, Mahendran Anish, Peper Joseph J, Clarke Christopher， Lee Andrew， Hill Parker， Kummerfeld Jonathan K， Leach Kevin， Laurenzano Michael A.， Tang Lingjia and Mars Jason. 2019. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction. In Proceedings of EMNLP-IJCNLP 2019. [3] Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, Ivan Vulić. 2020. Efficient Intent Detection with Dual Sentence Encoders. In Proceedings of ACL 2020. [4] Shikib Mehri, Mihail Eric, Dilek Hakkani-Tur. 2020. DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue. arXiv.

wjczf123 commented 3 years ago

@HanleiZhang Thank you for your detailed reply. I also notice that the default of factor_of_clusters is 1. This means we use the ground-truth classes to train? E.g., we set K' = 150 for CLINC dataset when the factor_of_clusters is 1.

HanleiZhang commented 3 years ago

Yes. We set the cluster number as ground truth, the same as in our previous work, CDAC+. Moreover, we also investigate the influence of the number of assigned cluster number. We vary the number of clusters from the ground-truth number to four times of it. Related discussion and results can be seen in the section "Results and Discussion->Effect of the Number of Clusters" of the original paper https://arxiv.org/pdf/2012.08987.pdf.

wjczf123 commented 3 years ago

@HanleiZhang OK. Thank you very much!

wjczf123 commented 3 years ago

@HanleiZhang I'm sorry to disturb you again. I have a question about G. I know that you want to use C_l to guide the next round clustering. But I still don't understand the details. The Hungarian algorithm find a one-to-one mapping (map C_l to C_c). But how to get y_align through y_c and G^{-1}？

HanleiZhang commented 3 years ago

That's a good question. We hope to explain your confusion.
At first, each sample y is assigned a pseudo label y_c after clustering. However, y_c may not be corresponding with the one in the previous epoch, as the cluster assignments (indices) are permuted randomly. Therefore, we use the cluster centroids as the aligned targets and find the projection G with the Hungarian algorithm. Notably, y_c is corresponding with the cluster index (e.g., y_c = 1 means it is assigned to the first centroid). G^{-1} means the mapping between the centroids in the current training epoch (C_c) and the centroids in the last epoch (C_l). We use G^{-1} on each pseudo-label y_c in the current epoch to align with the pseudo-label in the last epoch to obtain corresponding targets.

wjczf123 commented 3 years ago

@HanleiZhang Thank you for your detailed reply. I understand it. Thank you very much.

thuiar / DeepAligned-Clustering

Dataset #2