Open andreasbaumann opened 7 years ago
I'm trying to hack something looking similar to the storage configuration:
-d|--documentclass <DOCUMENT CLASS CONFIG>
Use an explicit document class <NAME> (default is auto-probing)
<DOCUMENT CLASS CONFIG> is a semicolon ';' separated list of assignments:
type=<MIME-type>, for instance 'text/plain'
encoding=<charset>, for instance 'UTF-8'
scheme=<scheme>
branch follows.
Ah. strusInsert has a -C
parameter:
if (!contenttype.empty() && !strus::parseDocumentClass( documentClass, contenttype, errorBuffer.get()))
{
throw strus::runtime_error(_TXT("failed to parse document class"));
}
But :
DLL_PUBLIC bool strus::parseDocumentClass(
duplicates code and should not be in the programLoader but maybe in strusAnalyzer near document class?
std::cout << "-D|--contenttype <CT>" << std::endl;
std::cout << " " << _TXT("forced definition of the document class of all documents inserted.") << std::endl;
-D short option for contenttype is not really logical, renaming to -C.
Puzzle:
strus::analyzer::DocumentClass documentClass;
if (!contenttype.empty() && !strus::parseDocumentClass( documentClass, contenttype, errorBuffer.get()))
{
throw strus::runtime_error(_TXT("failed to parse document class"));
}
// Load analyzer program(s):
strus::AnalyzerMap analyzerMap( analyzerBuilder.get(), analyzerprg, documentClass, segmentername, errorBuffer.get());
std::cerr << analyzerMap.warnings();
in strusInsert.
in strusAnalze there is:
strus::analyzer::DocumentClass dclass;
if (!textproc->detectDocumentClass( dclass, hdrbuf, hdrsize))
{
throw strus::runtime_error( _TXT("failed to detect document class"));
}
Though I'm passing:
-C 'type=text/tab-separated-values'
There is no tsv segmenter choosen for the given content type:
ERROR unhandled error in insert storage: database transaction with error: error defining expression for 'textwolf' segmenter: error in selection expression 'id' at start of expression
Specifying both works:
-C 'type=text/tab-separated-values' -g tsv
The Content-Type parameter is now implemented in a uniform way for all programs doing document analysis.
I understand the MIME-type and the encoding. What is scheme? The standard document class detector doesn't seem to set it?