luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval
Apache License 2.0
245 stars 23 forks source link

Whole word masking for RoBERTa #8

Closed eugene-yang closed 2 years ago

eugene-yang commented 2 years ago

Can you elaborate on why the first token is appended as an integer instead of [i] in line 65? If the first word is being separated by BPE, this seems to be resulting in an uncaught exception for the following token.

https://github.com/luyug/Condenser/blob/de9c2577a16f16504a661039e1124c27002f81a8/data.py#L61

luyug commented 2 years ago

What exception are you seeing?

eugene-yang commented 2 years ago

It is complaining about an integer doesn't have method .append.

luyug commented 2 years ago

Right, that probably needs to be fixed.

luyug commented 2 years ago

Fixed at TOT.