Conference : NAACL 2021 Link : https://arxiv.org/abs/2103.13136 Authors' Affiliation : University of Southern California TL;DR : 아직 합의된 holistic한 방법은 없다. string-based에서 그나마 좋다고 밝혀진 것은

scientific notation > decimal notation
character level > subword level

Summary :

1. Introduction

"11시에 일어났다"와 "11불을 벌었다"에서 11은 큰 차이가 있다. 11을 10으로 바꾸는건 되지만 25로 바꾸는건 안되는 일이다.

quantity와 갯수에 대해서 이해하는 것은 세계를 이해함에 있어서 중요하다.

우리 조상은 numeracy를 언어의 발전과 독립적으로 발전시켜왔다.

하지만 NLP에서는 전처리때 아예 지워버리거나 word와 같은 취급을 하거나, UNK로 붕괴시켜 버린다.

wordpiece같은 데서는 없애진 않지만 임의의 토큰들로 분리시킨다.

최근 논문들에서는 이들이 suboptimal number representation 이란 것을 밝혔다.

BERT는 정답이 숫자이면 span of text일때보다 5배 못한다.

단순히 subword에서 char-level tokenization으로 바꾸는 것이나 decimal 을 scientific notation으로 바꾸는 것만으로 성능이 향상된다.

이 논문에서는 numeracy task에 대한 taxonomy와 (section2) number representation (section3)를 제공한다.

2. Tasks

granularity와 unit라는 2가지 차원에 기반해서 분류한다.

granularity = number의 encoding이 정확한가 아니면 approximate한가 (새는 2개의 다리를 갖는다 vs 존은 대략 180cm이다.)
UInits = number가 abstract (2+3 = 5)한가 아니면 grounded인가 (2 apples + 3 apples = 5 apples). 대충 단위가 있는지인듯.

Simple Arithmetic = 1+1=2 같은거. synthetic dataset을 만들기 편함.
Numeration (or Decoding) = string form을 numeric value로 만드는것. 19 -> $19.0$ . NLP에서는 string의 representation에 대해 linear regressor를 돌리는 것으로 한다.
Magnitude Comparison = 2개나 그 이상의 숫자 중에 어떤게 큰지. 23 과 32가 주어지면 label 1을 골라야함.
Arithmetic Word Problems (AWP) = 2개의 쿠키가 있어서 하나를 줬다. 몇개가 남았는가?
Exact Facts = commonsense knowledge. 주사위는 6개의 면이 있다.
Measurement Estimation = 심리학쪽에서 사용하는 태스크로 대충 수박에 씨가 몇개 있겠는가 하는 질문 같은거.
Numerical Language Modeling = task가 아니라 setup이긴 하지만, 2+3=[mask] 같은 느낌. 사자의 몸무게는 [mask]kg이다. 일반적인 word LM에서와 다르게 accuracy나 perplexity가 아니라 regression loss를 사용해서 평가함.

3. Methods

string-based vs real-based

real-based에서는 computation을 해서 한다.

string-based에서는 number를 surface form으로 본다. 임의의 token id를 assign해서 임베딩을 look up해야한다.

3.1 Taxonomy

3.1.1 String based

LM에서는 디폴트로 number를 string으로 다룬다 (word와 똑같이).

Notation: 근데 숫자 80은 아라비아 숫자로 쓸수도 있고, 로마어로 쓸수도 있고 scientific notation으로 쓸수도 있고, english word eighty나 20진수 french나 다양하게 쓸수있다.
Tokenization : word-level tokenization은 효과적이지 못하다. 대부분이 UNK가 되므로. 다른 방법으로는 subword tokenization이나 character level이다.
Pooling : 하나의 숫자가 여러개의 토큰이 되면 pooling을 할 수 있다. 다른 데서는 풀링 대신 RNN이나 CNN을 사용하는 것을 주장했다.

3.1.2 Real based

number encoder를 $f: R \rightarrow R^d$ 로 나타낼 수 있다. 디코더 g는 반대.

Direction : encoder-only, decoder-only 방법들도 존재.
Scale : linear하지 않고 인지과학에서 영감을 받아 log scale로 인코드 하는 방법도 있다. 기타 방법들로 stabilized log scale, learned scale/flow 도 있음.
Discretization : 연속이나 이산이냐. 큰 범위의 숫자에 대해서 연속 함수를 학습하는건 practically infeasible하다. 그래서 먼저 binning을 한다. bin은 linear scale일수도 있고 log scale일수도 있다. 그 다음에는 lookup embedding을 일반의 cross entropy나 dense cross entropy로 학습한다.

3.2 survey of existing methods

3.2.1 string-based

GenBERT
NumBERT
DigitRNN, DigitCNN
DigitRNN-sci & Exponent (Embedding)

3.2.2 Real-based

DICE = scalar i와 j의 임베딩의 cosine similarity가 둘의 euclidean distance가 커질수록 작아지도록
Value Embedding = ??
Log Value
Log Laplace
Flow Laplace
Multi-Class Classification
Discrete Latent Exponent (DExp)
GMM
GMM-prototype
SOM-prototype

4 Results

Abstrct Probes
- word embedding > random embedding baseline
- DICE, Value, Log Value embeddings가 잘함.
- DigitCNN이 제일 잘하고 character-tokenized model이 subword 모델보다 잘함.
Arithmetic
- gpt-3가 zeroshot으로 잘함. (digit이 낮을경우)
- 제한된 extrapolation은 tokenization scheme이 문제일지도? digit/character level로 토크나이즈 하면 더 잘해짐.
MLM
- number가 scientific notation으로 나타나진 데이터셋에서 pretrain 되면 BERT가 mlm으로 학습할떄와 같은 loss 값으로 수렴한다?
- causal LM에서는 GMM이 best
- mlm에서는 mantissa까지 모델링하는건 overkill일지도?
Measurement Estimation

5 Recommendations

Rule of thumb for string-based methods
- scientific notation > decimal notation
- 이래야 exponent 부분에 mantissa 보다 집중할수 있음.
- character level > subword level
Rule of thumb for real-based methods
- log scale > linear scale
- binning > continuous value prediction (== dense xent > MAE loss) where gt distributions are available
- large range에서 continuous prediction을 모델링하는건 너무나도 어려움. 하지만 그런 distribution을 binning 하는 방법도 있음( precision level을 정해서)
Encoding vs Decoding numbers?
Can we mix-and-match multiple methods?
- 아직은 그닥
which methods for which tasks?
- method마다 다른듯

6 Vision for Unified Numeracy in NLP

pocca2048 / ML-paper-reading

Representing Numbers in NLP: a survey and a vision #2