patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

List of supported document types and segmenters #56

Closed andreasbaumann closed 7 years ago

andreasbaumann commented 7 years ago

A list of currently supported document formats and segmenters as well as what is the default segmenter for a document type would be nice to have.

andreasbaumann commented 7 years ago

API introspection functions and CLI support in strusHelp.

patrickfrey commented 7 years ago

strusHelp prints now the list of document segmenters available

andreasbaumann commented 7 years ago

Am I missing something?

strusHelp segmenter

prints:

Segmenter 'cjson' :
* Segmenter for JSON (application/json) based on the cjson library for parsing json and textwolf for the xpath automaton

Segmenter 'textwolf' :
* Segmenter for XML (application/xml) based on the textwolf library

Segmenter 'tsv' :
* Segmenter for TSV (text/tab-separated-values)

I cannot see 'segmenter' in the usage:

usage: strusHelp [options] <what> <name>
<what> = specifies what type of item to retrieve (default all):
         tokenizer     : Get tokenizer function description
         normalizer    : Get normalizer function description
         aggregator    : Get aggregator function description
         join          : Get iterator join operator description
         weighting     : Get weighting function description
         summarizer    : Get summarizer function description

Also I cannot find the document classes supported or the analyzer map anywhere?

patrickfrey commented 7 years ago

Forgot to add it. Now fixed strusHelp usage.

andreasbaumann commented 7 years ago
strusHelp segmenter
Segmenter 'cjson' :
* Segmenter for JSON (application/json) based on the cjson library for parsing json and textwolf for the xpath automaton

Segmenter 'plain' :
* Segmenter for plain text (in one segment)

Segmenter 'textwolf' :
* Segmenter for XML (application/xml) based on the textwolf library

Segmenter 'tsv' :
* Segmenter for TSV (text/tab-separated-values)

fixed long time ago.