Unsupervised Segmentation Learning

The goals of Unsupervised Segmentation Learning (USL) are: 1) Unsupervised learn of lexicon and tokenisation for languages like Chinese 2) Unsupervised learn for sentence splitting for languages like Chinese 3) Identification of primary (elementary) patterns in symbolic streams of data

Study setup: 1) Collect training set of either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain 2) Set up lexicon (library of compound symbols) and sentence/series breaking patterns for either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain 3) Implement POC of unsupervised tokeniser capable to infer lexicon (library of compound symbols) from training set and assess F1 matching inferred data against reference data
4) Implement POC of unsupervised sentence/series splitter capable to infer breaking patterns and chunk the stream of tokens/symbols using inferred and reference patterns, assess F1 matching inferred breakdowns based on the both.

Note: it may be considered that both "tokenisation" and "sentence splitting" are both parts of the same "segmentation" solution so there should be just one solution solving both problems depending on the specified number of "segmentation layers".

singnet / language-learning

Unsupervised Segmentation Learning #255