Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

0. 論文

Journal/Conference: ACL 2020 Title: Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora Authors: Hila Gonen, Ganesh Jawahar, Djamé Seddah, Yoav Goldberg URL: https://www.aclweb.org/anthology/2020.acl-main.51/

1. どんなもの？

コーパス間の用語変化を捉えるための簡単で安定性がある手法を提案具体的には，ターゲットとする単語の周辺の単語 (k近傍)を取り出して比較を行うための指標を提案

2. 先行研究と比べてどこがすごい？

既存の潜在空間をalignmentする手法よりも簡単で，安定性や解釈性が高い手法を提案した点

3. 技術や手法のキモはどこ？

単語ベクトルをアライメントさせるのではなく，近傍k単語集合の近さをみることで，安定性・解釈性を高めた

4. どうやって有効だと検証した？

5. 議論はある？

手法としては既存の研究を単純化したものであるが，どういった点がReviwerに評価されたのか？

6.次に読むべき論文は？

既存の用法が変化した語の検出法比較 Dominik Schlechtweg, Anna H ̈atty, Marco Del Tredici,and Sabine Schulte im Walde. 2019. A Wind ofChange: Detecting and Evaluating Lexical Seman-tic Change across Times and Domains. InProceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 732–746, Flo-rence, Italy. Association for Computational Linguis-tics

メモ

これを応用すると，時間とともに語がどのような語に変化したかがわかる！ここの関連研究はちゃんと読むべきコード︰https://github.com/gonenhila/usage_change

Abst 人文科学と計算社会科学では使われ方が異なる単語が存在する各コーパスのword embeddingを学習，vector spaceをalignmentsをとり，cosin類似度の大きい単語を探索するという方法が一般的だが信頼性が低い各単語の近い単語を考慮する代替的なapproachを提案 9つの異なる設定のデータセットで実証

1 Intro 時間経過とともに意味が変化する単語や人工によって異なる使用法がされる単語などについての分析ではrobustな手法が良いベクトル空間のアライメントアルゴリズムを用いてコサイン類似度での測定という手法が一般的だが不安定 (chap.3, 7) William L. Hamilton, Jure Leskovec, and Dan Jurafsky.2016b. Diachronic Word Embeddings Reveal Sta-tistical Laws of Semantic Change. InProceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1489–1501, Berlin, Germany. Association forComputational Linguistics.

提案手法 → コーパス間で著しく異なって使用されている単語は，異なる文脈をもっているのでは？という仮説の元，上位k個のneighborhoodsを考慮して交わって含まれる語彙の量を調査・simplicity：・stability：alignment-approachと違い，異なるword-embeddingでも同様の結果となる・interpretability：直感的な分析が可能・locality︰各単語のスコアはneighbor wordsのみに依存する (porjection basedの手法ではアルゴリズムに結構依存)

言語横断的な適用も可能という特性単語変化検出において有用な手法の1つとなりうる可視化するためのツールキットを提案

2 Task Definition 意味の変化の検出が目的 Hosein Azarbonyad, Mostafa Dehghani, Kaspar Bee- len, Alexandra Arkut, Maarten Marx, and Jaap Kamps. 2017. Words Are Malleable: Computing Semantic Shifts in Political and Media Discourse. In Proceedings of the 2017 ACM on Conference on In- formation and Knowledge Management, CIKM ’17, pages 1509–1518 Marco Del Tredici, Raquel Ferna ́ndez, and Gemma Boleda. 2019. Short-Term Meaning Shift: A Distributional Exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2069–2075, Minneapolis, Minnesota. Association for Computational Linguistics. 意味の変化 (meaning change)というよりも用法の変化 (usage change)

3 Stability 最近の研究ではword embeddingの安定性に疑問：特に小さいコーパスに対して Word embeddingの近傍のリストを見ても安定性などの問題あり︰ Laura Wendlandt, Jonathan K. Kummerfeld, and RadaMihalcea. 2018. Factors Influencing the SurprisingInstability of Word Embeddings. InProceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers), pages 2092–2102, New Orleans, Louisiana.Association for Computational Linguistics

変化検出アルゴリズムの安定性を測定する指標を提案 2つのコーパスを入力とした時に，変化した可能性の高いごくから変化した可能性の低い語句のランキングを返す intersection@kという指標で上位k個の単語が同一かどうかを測定 (e.q.1)：単純に2つのリストの被ってる部分を返す (Kが大きくなるとintercept@kも高くなる)

4 The predominant approach 使用法変化検出の一番有名な方法︰William L. Hamilton, Jure Leskovec, and Dan Jurafsky.2016b. Diachronic Word Embeddings Reveal Sta-tistical Laws of Semantic Change. InProceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 1489–1501, Berlin, Germany. Association forComputational Linguistics. → 2つのコーパス上で単語のembeddingを学習しaligning the spaceさせ，cosine 距離で単語をランキング (AlignCosと呼ぶ：XをYへ投影する線形変換Qを見つける作業) P540の下

→ 最近の研究はこの手法をベースに Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, andHui Xiong. 2018. Dynamic Word Embeddings forEvolving Semantic Discovery. InProceedings ofthe Eleventh ACM International Conference on WebSearch and Data Mining, WSDM ’18, pages 673–681. ACM Maja Rudolph and David Blei. 2018. Dynamic Embed-dings for Language Evolution. InProceedings ofthe 2018 World Wide Web Conference, WWW ’18, pages 1003–1011

言語のペアを発見するためのaligning embedding spacesは他の分野でも応用される：多言語での対応をとるなど Yova Kementched jhieva, Sebastian Ruder, Ryan Cot-terell, and Anders Søgaard. 2018. Generalizing Pro-crustes Analysis for Better Bilingual Dictionary In-duction. InProceedings of the 22nd Conference onComputational Natural Language Learning, pages211–220, Brussels, Belgium. Association for Com-putational Linguistics

4.1 Shortcomings of alignment approach ・Self-contradicting objective：用法が変更された単語も近づけてしまう・Requires non-trivial filtering to work well︰語彙フィルタリングの必要性 (固有名詞などはmappingしにくいことから生じる) ・Not stable across runs︰結果が毎回変動する可能性が高い

5 Nearest Neighbors as a Proxy forMeaning K個の最近傍の集合を比較してe.q.2を算出 K =1000と設定

limitation： embeddingの質に大きく依存必ずしも用法が変化したことを保証しない︰候補を抽出しているだけ結果は最終的に人手で解釈する必要がある

6 既存の手法の中ではAlignCosよりも高い精度を達成した手法はない Dominik Schlechtweg, Anna H ̈atty, Marco Del Tredici,and Sabine Schulte im Walde. 2019. A Wind ofChange: Detecting and Evaluating Lexical Seman-tic Change across Times and Domains. InProceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 732–746, Flo-rence, Italy. Association for Computational Linguis-tics

Table1︰コーパスの統計量

3つの人口統計学に基づいた区別：age, gender, occupation，曜日に基づく区別，時間的長さに基づく区別などで分析

・Author Demographics コーパスには年齢・性別・職業などのラベルが付与 Matti Wiegmann, Benno Stein, and Martin Potthast. 2019. Celebrity Profiling. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 2611–2618, Florence, Italy. Association for Computational Linguistics.

・Day-of-week 580million tweets in English from June 2009 toFebruary 2010 Jaewon Yang and Jure Leskovec. 2011. Patterns of Temporal Variation in Online Media. In Proceed- ings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 177–186, New York, NY, USA. ACM.

・French Diachronic 2014年 - 2018年からのヘブライ語データセット

・English Diachronic 英語コーパスの時系列︰Hamiltonの論文で提供

6.1 Implementation details 300 dimensions word2vec vectors with 4 wordscontext windowで検証 Vocabulary and filtering︰低頻度語 (20%の単語を除去)

7 Results 7.1 Qualitative Evaluation: Detected Words Table2：Ageによる分割で検出された上位10語

7.2 Quantitative Evaluation: Stability intersection@kも他の手法より有意によい

7.3 Quantitative Evaluation: DURel and SURel datasets DURelデータセットとSURelデータセット (人間のアノテーションがあり) スピアマン相関とDiscounted cumulative gain (DCG)で評価結果︰Table3

7.4 Interpretation and Visualization t-SNEを用いて可視化 Fig2, Fig3

8 Related works 単語の安定性を決定するために隣接の単語を利用：Hosein Azarbonyad, Mostafa Dehghani, Kaspar Bee- len, Alexandra Arkut, Maarten Marx, and Jaap Kamps. 2017. Words Are Malleable: Computing Semantic Shifts in Political and Media Discourse. In Proceedings of the 2017 ACM on Conference on In- formation and Knowledge Management, CIKM ’17, pages 1509–1518 → 語彙全体についての計算を必要

対象語とその近傍語との間の類似度の変化に基づき，文化的な変化を捉える︰William L. Hamilton, Jure Leskovec, and Dan Juraf- sky. 2016a. Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process- ing, pages 2116–2121, Austin, Texas. Association for Computational Linguistics

BERTから取得されたembeddingを用いてdiachronic and usage changeを捉える Mario Giullianelli. 2019. Lexical semantic change analysis with contextualised word representations. Master’s thesis, Institute for Logic, Language and Computation,, University of Amsterdam, July. Matej Martinc, Petra Kralj Novak, and Senja Pollak. 2019. Leveraging contextual embeddings for detect- ing diachronic semantic shift. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

オックスフォード辞書からBERTモデルに Renfen Hu, Shen Li, and Shichen Liang. 2019. Diachronic sense modeling with deep contextualized word embeddings: An ecological view. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3899–3908, Florence, Italy. Association for Computational Linguistics.

hkefka385 / paper_reading