SeongkukCho commented 2 years ago

Keywords

Pre-training for Retriever

TL;DR

A method of jointly fine-tuning the retriever and reader in an end-to-end manner for the question-answer dataset after pre-training the retriever

Abstract

Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.

Paper link

https://arxiv.org/abs/1906.00300

Presentation link

https://docs.google.com/presentation/d/1ZoOwYp_qWSZz7W8X6nLyQvA1ON7dLnSOAVdn58ZU9h0/edit#slide=id.p5

video link

https://youtu.be/MypoV0xAn18

SeongkukCho commented 2 years ago

Issue 1. ICT masking과 MLM of BERT contrastive learning에 batch 사이즈는 매우 중요합니다!!

1) 영수님 의견

약간 내용이 섞인것 같아서.. 저랑 이삭님이 말씀드렸던 부분은 BERT의 MLM과 다르다는게 아니고, 처음에 전체 corpus중 90%만 masking한다는 내용으로 설명된것 같아서.. BERT MLM처럼 pseudo query 만들때 베르누이분포로 0.9확률로 masking하는거 아니냐는 문의였습니다.

2) 재욱님 의견

논의) BERT의 MLM과 ICT의 masking 메커니즘은 같은가? a. BERT MLM의 경우 전체 token seq의 일부 tokens을 [MASK] 로 바꾸고 그것을 맞추는 task입니다. → 이때 [MASK] token으로 변경되는 비율을 전체 token seq의 15% 정도로 유지합니다. → 더 정확하게 말하면 베르누이 확률 q=0.15로 주고 token masking을 합니다. b. ICT는 context로 부터 한 문장을 sentence masking 하여 아래와 같이 positive pairing을 합니다. → pseudo-query: masking한 문장 → pseudo-evidence: pseudo-query가 masking된 context
이와 같이하면 pseudo-query와 pseudo-evidence 사이에 lexical matching이 매우 * 적어지기 때문에 모델이 semantic matching을 하도록 학습할 수 있습니다 그럼에도 불구하고 lexical matching은 retrieval에서 여전히 중요하기 때문에 전체의 block 모두에서 sentence masking하지 않고 10% 정도 pseudo-query가 포함된 pseudo-evidence을 남겨놓습니다.
정리하면 위에 대한 논의에 대한 대답은 yes이구요.
다만 MLM의 15%와 ICT의 90%는 다른 맥락에 적용된 숫자입니다.
BERT는 여러 token을 동시에 token masking하고 그 비율이 15%
ICT는 전체 evidence block들 중 sentence masking 적용한 비율이 90%

3) 유경님 의견

[Should You Mask 15% in Masked Language Modeling?]
https://arxiv.org/abs/2202.08005
아까 잠깐 이야기가 나온 것 같아서 소개드리면 좋을 것 같은데 15% masking이 optimal이 아니라는 연구들이 나왔습니다 ㅎㅎ 저도 연구할때 실험적으로 pretraining masking rate를 높이니 성능이 더 잘 나왔었는데 성능까지 측정해서 발표한 논문이 있더라구요 ! 참고로 SimCSE 저자분이 참여하셨습니다!

@ ICT Model code: https://github.com/google-research/language/blob/master/language/orqa/models/ict_model.py

jwkanggist commented 2 years ago

성국님 노트 감사합니다! @SeongkukCHO

jwkanggist / SSL-narratives-NLP-1