NER performance with Ontonotes and number-related ELMo embeddings

So it's simply that my batch size with ELMo was too small, so the less frequent classes had too few labels per batch for learning (e.g. single label per batch, the usual thing to avoid!).

The batch size and multiprocessing/parallel worker were adapted to ELMo, to keep the memory usage under 11GB (for training with a GTX 1080Ti). For something more generic, it might be necessary to review how the batch are created to ensure that rare classes are well represented, with automatic over-sampling techniques for instance.

However, for the time being, simply increasing the batch size looks good for Ontonotes to reach a f-score > 88.0 as expected.

kermitt2 / delft

NER performance with Ontonotes and number-related ELMo embeddings #7