keon / awesome-nlp

:book: A curated list of resources dedicated to Natural Language Processing (NLP)
Creative Commons Zero v1.0 Universal
16.73k stars 2.58k forks source link

Another NLP tool #188

Closed amir-zeldes closed 5 years ago

amir-zeldes commented 5 years ago

@NirantK I thought I would check before submitting another PR – would the following NLP tool fit the list?

https://github.com/amir-zeldes/RFTokenizer

It is a trainable subword tokenizer for morphologically rich languages, such as Afro-Asiatic languages. It comes with pre-trained models for Arabic, Hebrew and Coptic, supports Python 2-3 and is installable from PyPI. Current performance is SOA on this task at least for Hebrew and Coptic (not sure about Arabic, since different papers seem to use different targets and metrics).

NirantK commented 5 years ago

Sure. We could add these to language specific tools in Arabic to begin with.

I'd love to see a comparison/benchmarking against more popular tokenizers like Moses or sentencepiece/Wordpiece on your chosen/standard datasets.

If it is indeed SoA, should also add to http://nlpprogress.com/

amir-zeldes commented 5 years ago

Great, I'll do that then!

About comparison with other tools: I was under the impressions that Moses just does word form tokenization, not division into morphological subword tokens, but if there is a way to do subtoken segmentation, please let me know.

My understanding of WordPiece is that it finds subword segments in an unsupervised way from very large amounts of data, including pieces that may or may not make sense linguistically, for downstream applications such as training contextual embeddings. The standards targeted by this tool are linguistically motivated test sets, so supervised learning with limited amounts of data (100-200K tokens, something like UD Treebanks). Most previous work on this task has used lexical resources, which are essential for high accuracy gold standard prediction, so I think things like WordPiece are in a different category (optimized for downstream tasks like NMT, not a gold stand-alone product).

This paper compares our tool with the previous SOA for Hebrew and finds a substantial improvement in segmentation accuracy:

RFTokenizer: https://aclweb.org/anthology/W18-5811 Previous Hebrew SOA: https://aclweb.org/anthology/C16-1033 Previous Coptic SOA: https://www.aclweb.org/anthology/W16-2119

For Arabic different tools use different standards (segmenting articles or not, segmenting feminine derivations or not, different corpora), so I'm not sure the comparison is apples to apples. SOA seems to be around 98% (https://www.aclweb.org/anthology/N16-3003, but on another dataset, different guidelines), and our tool is in that range as well.