Closed lautel closed 4 years ago
Thank you for the practical request.
In conclusion, the best way to solve the problem you pointed out is by pre- and post-processing, not by the morphological analysis process itself. We think it's easier to control the splitting performance.
Since there are too many possibilities for notation of e-mail addresses and URLs, all patterns cannot be recorded in the dictionary in the developing phase.
If you want to resolve the problem by preprocessing, it is convenient to replace the email address or URL with an unknown word and restore it after parsing.
If the solution is post-processing, the morphological analysis results should be chained with CRF e.t.c.
These solutions are very common and are found in various textbooks of natural language processing.
If you don't need phonetic characters for Japanese processing, you might want to use Juman++ or Nagisa.
Thank you very much.
Motivation and Goal
Instead of breaking down an email address and/or an URL, it could be a desirable option to be able to identify email addresses and URLs as a single token. See example below to compare current behavior to the suggested one.
Sample code
Output
Desirable output