ufosc / DocuMiner

A production-ready pipeline for text mining and subject indexing
MIT License
8 stars 5 forks source link

Multi-Word Expression Tokenization #3

Open Fennec2000GH opened 2 years ago

Fennec2000GH commented 2 years ago

Description

Enable rule-based tokenization that regroups neighboring tokenized terms that logically belong together under the same entities. Think compound words or full names.

Objectives

  1. Edit tokenization functions to allow a variable number of parameters to allow for specific rules and exceptions during tokenization.