Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
Some segmenters like CSV need options, because their behaviour is not fully standardised and the data column descriptions may be declared at a separate place, not part of the file itself.
Some segmenters like LibXML (does not exist, but may in the future) may have options to explicitely switch off some behaviour required by the standard but not desired because of vulnerabilities. For example attacks by recursive entity declarations.
I would suggest a structure SegmenterOptions containing an array of name value pairs passed to SegmenterInterface::createInstance(). The options available are dependent on the segmenter implementation.
Some segmenters like CSV need options, because their behaviour is not fully standardised and the data column descriptions may be declared at a separate place, not part of the file itself.
Some segmenters like LibXML (does not exist, but may in the future) may have options to explicitely switch off some behaviour required by the standard but not desired because of vulnerabilities. For example attacks by recursive entity declarations.
I would suggest a structure SegmenterOptions containing an array of name value pairs passed to SegmenterInterface::createInstance(). The options available are dependent on the segmenter implementation.