patrickfrey / strusAnalyzer

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

What is the meaning of scheme in DocumentClass #40

Open andreasbaumann opened 7 years ago

andreasbaumann commented 7 years ago

I understand the MIME-type and the encoding. What is scheme? The standard document class detector doesn't seem to set it?

andreasbaumann commented 7 years ago

I'm trying to hack something looking similar to the storage configuration:

-d|--documentclass <DOCUMENT CLASS CONFIG>
    Use an explicit document class <NAME> (default is auto-probing)
      <DOCUMENT CLASS CONFIG> is a semicolon ';' separated list of assignments:
            type=<MIME-type>, for instance 'text/plain'
            encoding=<charset>, for instance 'UTF-8'
            scheme=<scheme>

branch follows.

andreasbaumann commented 7 years ago

Ah. strusInsert has a -C parameter:


        if (!contenttype.empty() && !strus::parseDocumentClass( documentClass, contenttype, errorBuffer.get()))
        {
            throw strus::runtime_error(_TXT("failed to parse document class"));
        }

But :

DLL_PUBLIC bool strus::parseDocumentClass(

duplicates code and should not be in the programLoader but maybe in strusAnalyzer near document class?

andreasbaumann commented 7 years ago
            std::cout << "-D|--contenttype <CT>" << std::endl;
            std::cout << "    " << _TXT("forced definition of the document class of all documents inserted.") << std::endl;

-D short option for contenttype is not really logical, renaming to -C.

andreasbaumann commented 7 years ago

Puzzle:

        strus::analyzer::DocumentClass documentClass;
        if (!contenttype.empty() && !strus::parseDocumentClass( documentClass, contenttype, errorBuffer.get()))
        {
            throw strus::runtime_error(_TXT("failed to parse document class"));
        }
        // Load analyzer program(s):
        strus::AnalyzerMap analyzerMap( analyzerBuilder.get(), analyzerprg, documentClass, segmentername, errorBuffer.get());
        std::cerr << analyzerMap.warnings();

in strusInsert.

in strusAnalze there is:

        strus::analyzer::DocumentClass dclass;
        if (!textproc->detectDocumentClass( dclass, hdrbuf, hdrsize))
        {
            throw strus::runtime_error( _TXT("failed to detect document class")); 
        }
andreasbaumann commented 7 years ago

Though I'm passing:

-C 'type=text/tab-separated-values'

There is no tsv segmenter choosen for the given content type:

ERROR unhandled error in insert storage: database transaction with error: error defining expression for 'textwolf' segmenter: error in selection expression 'id' at start of expression

Specifying both works:

-C 'type=text/tab-separated-values'  -g tsv 
patrickfrey commented 7 years ago

The Content-Type parameter is now implemented in a uniform way for all programs doing document analysis.