ispras-texterra / derek

DEREK (Domain Entities and Relations Extraction Kit)
GNU General Public License v3.0
10 stars 1 forks source link

Employ capitalization features #23

Open trifonov-vl opened 5 years ago

trifonov-vl commented 5 years ago

We can implement capitalization token features in our common context encoder. I propose following categorized features: allUpper, allLower, upperFirst, upperNotFirst, numeric, noAlphaNum.

trifonov-vl commented 5 years ago

This can improve results especially on NER task.

i-a-andrianov commented 5 years ago

@vladosdudos I recently had a related idea: mask all tokens by character type. E.g., "Proton-18M" becomes "Aa-0A" while "recently" - "a" and "Before" - "Aa". We could then train embeddings for these masks along with word embeddings themselves.

trifonov-vl commented 5 years ago

@i-a-andrianov This information could be coded with CharCNN: to char embedding we can add character type feature: capitalized, numeric, non-alphanumeric. Or use it without character embedding as separate token encoder.

i-a-andrianov commented 5 years ago

@vladosdudos I agree with char-CNN point. My point is that such masks are easy to implement and could work better than char-CNN on small datasets as they are less sparse and more global (see behind kernel size).

trifonov-vl commented 4 years ago

I've tested implemented features on CoNLL-03 and Ontonotes. The quality did not increase on CoNLL-03 and minorly increased on Ontonotes with devset. Here are reports: conll_experiments_report.json.txt ontonotes_experiments_report.json.txt