singnet / language-learning

OpenCog Unsupervised Language Learning
https://wiki.opencog.org/w/Language_learning
MIT License
32 stars 11 forks source link

Unsupervised Segmentation Learning #255

Open akolonin opened 5 years ago

akolonin commented 5 years ago

The goals of Unsupervised Segmentation Learning (USL) are: 1) Unsupervised learn of lexicon and tokenisation for languages like Chinese 2) Unsupervised learn for sentence splitting for languages like Chinese 3) Identification of primary (elementary) patterns in symbolic streams of data

Study setup: 1) Collect training set of either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain 2) Set up lexicon (library of compound symbols) and sentence/series breaking patterns for either A) Chinese texts OR B) numeric and symbolic streams of data for specific domain 3) Implement POC of unsupervised tokeniser capable to infer lexicon (library of compound symbols) from training set and assess F1 matching inferred data against reference data
4) Implement POC of unsupervised sentence/series splitter capable to infer breaking patterns and chunk the stream of tokens/symbols using inferred and reference patterns, assess F1 matching inferred breakdowns based on the both.

Note: it may be considered that both "tokenisation" and "sentence splitting" are both parts of the same "segmentation" solution so there should be just one solution solving both problems depending on the specified number of "segmentation layers".

akolonin commented 1 year ago

SOTA: https://arxiv.org/pdf/2205.11443.pdf