Open trifonov-vl opened 5 years ago
This can improve results especially on NER task.
@vladosdudos I recently had a related idea: mask all tokens by character type. E.g., "Proton-18M" becomes "Aa-0A" while "recently" - "a" and "Before" - "Aa". We could then train embeddings for these masks along with word embeddings themselves.
@i-a-andrianov This information could be coded with CharCNN: to char embedding we can add character type feature: capitalized, numeric, non-alphanumeric. Or use it without character embedding as separate token encoder.
@vladosdudos I agree with char-CNN point. My point is that such masks are easy to implement and could work better than char-CNN on small datasets as they are less sparse and more global (see behind kernel size).
I've tested implemented features on CoNLL-03 and Ontonotes. The quality did not increase on CoNLL-03 and minorly increased on Ontonotes with devset. Here are reports: conll_experiments_report.json.txt ontonotes_experiments_report.json.txt
We can implement capitalization token features in our common context encoder. I propose following categorized features:
allUpper
,allLower
,upperFirst
,upperNotFirst
,numeric
,noAlphaNum
.