opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 118 forks source link

Improve error reporting #414

Closed ampli closed 7 years ago

ampli commented 8 years ago

Hi Lians, The discussion started at PR #400. Here is the relevant section:

In order that the library will be able to get embedded seamlessly in applications (like grammar checkers or GUIs) it should not print such error messages, but instead only provide an error indication, and put the error in a variable (a per-thread variable if multi-threaded), to be retrieved by a new function link-grammar_error(). Such an arrangement is found in other libraries.

I thought of two different ways to implement such a scheme:

  1. Putting the error messages in a thread-local variable. A function like get_linkgrammar_lasterror() will be called after each library API call and will retrieve the messages from this variable.
  2. Putting the error messages in the "object". This means adding a "message" member to the Dictionary, Sentence, Linkage and ParseOptions structures. There will be several function to retrieve the errors, one per object (or one function can be used if the "messages" member is always the first one in the corresponding structure).

But option (2) has a problem: If the dictionary couldn't be opened, there is no object in which to store the error due to which it couldn't be opened (there can be many). My proposed solution for that is use NULL as the object when the returned dict is NULL (and in that case retrieve the error from a global variable).

Another thing to note is that most of the API functions, including each of the parse_optionget*() functions, will have to start with a call to reset_lasterror(), so errors will not get accumulated over API calls - there can be several error messages from the same API call, so they are allowed to be accumulated per API call, and to be returned at once by get_linkgrammar_lasterror().

I already implemented most of (1), but then thought again on (2), so I need you guidance regarding what to actually implement.

Since I intend to initially make it completely compatible with the current way of error reporting (to the extent that no change will be needed in existing applications) there is no need to do things totally the "right" way from the start. We can just declare the new error reporting way as "experimental" for several releases until it get stabilized.

linas commented 8 years ago

option 1 seems OK. I understand the appeal of option 2 but it seems to add complexity, with no real benefit - our system is just not that error-complex.

I dislike the idea of an automatic reset_lasterror(), except maybe in only one case only: sentence_create(). The idea is that the user could make a whole bunch of API calls, and then check if there was an error, at the end. That error code would only get reset when starting with a new sentence.

ampli commented 8 years ago

After trying to implement both, option (2) is indeed much more complicated, and has no added functionality.

Regarding resetting messages: There are currently plenty of notification messages on things which are not considered as errors. I mainly thought of discarding them if not gathered "in time". However, when thinking on that again, it indeed not needed, and on sentence_create() it is enough.

In order to see that the new error message has the needed functionality, I will try to write a simple GUI parser application.

linas commented 8 years ago

An old trick that I've seen with C interfaces is to have a stack of errors, and the user has to pop the stack until empty, to get them all. That way, you can return multiple error conditions, if needed. The user, of course, won't know wexactly where they might have occurred, but that is their fault, not ours.

ampli commented 8 years ago

My current implementation has a API to get all errors at once, so usual usage doesn't need too much efforts, and also has an API to fetch the messages/errors one by one.

It does that by internally concatenating all the messages to one string, and providing a vector of pointers into this strings for fetching individual errors. This is essentially equivalent to the said stacked errors, but in a thought that the usual usage will be to print all the stacked messages at once (because the current library may emit several messages per API call).

However, a similar alternate implementation may have just the error stacked, and an API call option to concatenate them. I will try both implementation to see which is simpler and easier to use.

ampli commented 7 years ago

Hi Linas, I implemented the new error reporting method, using a slight different way than the one I described above. It is (optionally) using a stack of errors. No changes were needed in link-parser.c.

I need you comments in order to refine the API. After that I will also implement a Python bindings. Because the said error reporting method implementation is compatible by default to the current way, languages without these bindings will just function as before.

I'm not sure about several implementation details.

General implementation details:

1) I used the word error in the function names etc. (e.g. linkgrammar_print_errors()) even though these are not only errors but also warnings, info, and even debug/trace messages or just messages (I could use the word message but similar interfaces use error ). 2) I used linkgrammar in the visible definitions in order to prevent possible clashes. 3) I used a very limited subset of what provided by other libraries, e.g. SSL, because the unimplemented features seems to me relatively minor, and the implemented ones seems to be flexible enough and extendable if needed. 4) I'm not sure regarding how to specify the severity level. Currently it is specified as an initial string, and internally converted to a number, which is essentially unused - its comparison to Info is essentially a no-op since the error-context is never set for Info. Currently sometimes it is not set by mistake, or set in an unrecognized way.

Instead, I suggest to use a symbolic severity level, i.e. to use something like: prt_error(Error, "No language specified!") However, this will make it needed to include enum severity in link-includes.h, which theoretically may cause symbol clash with other libraries. However, in such a case it is easy enough to avoid such a clash by a bunch of #define's before link-includes.h.

Longer names are also possible, like linkgrammar_error. However, it will make the usage cumbersome. A solution can be to then provide a macro/function for each severity level: prt_fatal(), prt_error(), prt_warn(), prt_info().

5) I suggest to use the new error reporting method in a more general way:

New API functions:

1) Set a callback function and argument to be called on each error. If not set, a default handler default_error_handler() is used, with NULL data (it just prints the errors). If the function is NULL, the error is just stacked. It returns the previous error handler function (which is default_error_handler() on its first calle). Note: Currently there is no way to get the corresponding data of the old function. But since the default data is NULL this doesn't seem important.

link_public_api(linkgrammar_error_handler)
     linkgrammar_set_error_handler(linkgrammar_error_handler, linkgrammar_error_data *);

2) Format a raw error message (it adds link-grammar: and maybe more, like converting the severity level integer value to a string). Usually there is no direct need to use it, unless a new linkgrammar_error_handler is set (see (1) above). (It can be extended with a format argument like: "%n: %l: %m" but most probably there is no need for that.)

link_public_api(char *)
     linkgrammar_format_error(linkgrammar_error *lge);

3) Print all stacked errors and clear them. I didn't provide a way to print or clear a particular error. An error handler function has to be supplied. It can be the default one, as returned by an initial call to linkgrammar_set_error_handler(). Alternatively - not implemented, the default one can be called if NULL is supplied (but see (4) below).

link_public_api(int)
     linkgrammar_print_errors(linkgrammar_error_handler, linkgrammar_error_data *);

4) Clear all the errors. It returns the number of errors cleared. Not implemented: Argument false to just return the number of stacked errors. It can be made redundant if function NULL to linkgrammar_print_errors() (see (3) above) would mean discarding the errors.

link_public_api(int)
     linkgrammar_clear_errors(void);

New API data definitions:

1) Error Severity. (See implementation detail (5) above for a longer severity name suggestion.)

typedef enum
{
    Fatal = 1,
    Error,
    Warn,
    Info,
    Debug,
    None
} linkgrammar_error_severity;

3) Each error generates this structure. I removed err_ctxt ec from it (but not from the error reporting) because it seems to me less useful to have it and I couldn't find a simple way to extend it for non-sentence errors.

typedef struct linkgrammar_error {
    /* err_ctxt ec; */
    linkgrammar_error_severity sev;
    const char *msg;
} linkgrammar_error;

2) This is the error handler function.

typedef int (*linkgrammar_error_handler)(linkgrammar_error *, linkgrammar_error_data *);

3) This is the data for error handler function data (NULL for the default error handler). stdout_sev is the severity below-or-equal it the function should write to stdout (default is Warn) so informatial output will still be synchronized to regular output by default, while allowing real error to bypass stdout. In a windowing implementation the data can include the window handle, and in a web implementation the error message tag ID etc. It can also control on which level to put "link-grammar:" prefix (not implemented).

typedef struct linkgrammar_error_data {
    linkgrammar_error_severity stdout_sev;
    void *data;
} linkgrammar_error_data;

4) This I would like to provide in order to allow custom formatting of the error message, but I'm not sure how to provide it in link-includes.h. Possibilities are as extern variable, or as a function like const char *linkgrammar_error_severity_lable(int) (which has a benefit of printing "Unknown severity x" for out-of-bound values and return "" for None (else "Error: "etc.).

const char *linkgrammar_error_level[] =
{
    "Fatal", "Error", "Warning", "Info", "Debug", "None", NULL
};
linas commented 7 years ago

To have shorter names, I think its ok to just use lg_ as the function prefix, instead of linkgrammar_

I do NOT like the void* at all.

If there is an error, then there needs to be some way of passing useful information about it: e.g. if there is a bug in a dictionary, then the error message needs to say "error at line number NN in dictionary DD" where DD is either a file-name, or we pass a pointer to struct Dictionary to the error handler, so that the error handler can get additional info from the dictionary. Similar, if its an error in the sentence, its useful to either pass a long string, or to pass struct Sentence, or struct linkage, to allow the error handler to find and print more data. (there's no public api for the tokenizer.... maybe there should be?)

So, the original err_ctxt is almost exactly like your new linkgrammar_error_data, but different. I don't care if you keep or get rid of err_ctxt. However, there should be only one struct, not two: both struct linkgrammar_error and struct linkgrammar_error_data should be the same struct.

ampli commented 7 years ago

As usual, I guess I was not clear enough. I will try to clarify it:

struct linkgrammar_error and struct linkgrammar_error_data are for totally different and disjoint purposes. Actually, the later should have been named struct linkgrammar_error_handler_data but I shortened its name.

struct linkgrammar_error is for the per-error data. Its msg member is for the error message, which has all the information you mention, i.e. it continues to be something like "error at line number NN in dictionary DD" as is now (I don't propose at all to generally change the error messages). I didn't find any point for now to have for now more detailed data than an error string (such data needs to be totally different for dictionary error, sentence error, option error, argument error, locale error and maybe more) and the severity. However, later we can decide to change the error string format according to de-facto standards, so, for example, an editor will be able to automatically locate the cursor under the problematic dictionary construct.

(BTW, if desired, it is easy to transparently add a source-code function, file and line for the originating message. This can be added later with a local change in the error functions.)

struct linkgrammar_error_data is for a totally another purpose. It is for the API user, and is constructed solely from a user-provided data (no per-error data i it at all) to be provided to the error callback function: Since the program can be multithreaded or multi-window, it can provide the error function callback the window details (like handle or ID) to which it should write its message, the color scheme of messages, etc.

Think of an editor like vim with multiple open files, or an interactive chat with several different sessions, when error messages should go to the correct place for each file/session.

Hence lg_set_error_handler(lg_error_handler, lg_error_data *) sets at once both the error handler callback function and its private data (per-thread if the compiler supports TLS, as detected by configure). In addition, the error handler callback function gets two arguments: typedef int (*lg_error_handler)(lg_error *, lg_error_data *); The first one is the per-error data (which can be extended later if desired). The second one is the per-function API-user-provided data (which we don't know in advance, hence the void *).

To sum up:

Its only component that is "not minimal" and unused for now is the error stack. I didn't find a good use for it, since the callback function method (along with its API-user data per function) seems to me flexible enough for all purposes.

BTW, I used linkgrammar_ and not lg_ due to these reasons:

However, I see now that there are already many corpus/sence/cluster functions that start with lg_. So maybe linkgrammar_ should be replaced by lg_ even for these existing 3 functions.

linas commented 7 years ago

oh OK,.. then why not have just int (*lg_error_handler)(lg_error *,void*) ? That seems to be the most common style for this.

ampli commented 7 years ago

oh OK,.. then why not have just int (_lg_error_handler)(lgerror *,void) ? That seems to be the most common style for this.

Initially I did it this way.

However, I wanted to consolidate the Debug/Trace facility with the error facility, and to provide a way to have the Info/Debug/Trace messages (and even the Warning messages, which are essentially informational in the LG library) output by default on stdout so they line-up with the results, while the real error messages (Error/fatal) still output in the traditional manner to (EDIT) stderr.

The above is the current default behavior, but if the API user would like to change it, to that end I added the linkgrammar_error_severity member to the error callback user-data struct, while sill providing in it the traditional void *. Else there is no way to change it without rewriting the default error handler. However, it is very easy to rewrite it - (it is an open source so the default error handler can be copied and slightly modified) so if this seems to you gross I can change it back to void *.

Some other questions remain that I need to resolve (for details see my initial message from today). Here is a summary of my proposals:

linas commented 7 years ago

On Tue, Nov 8, 2016 at 7:22 PM, Amir Plivatsky notifications@github.com wrote:

oh OK,.. then why not have just int (*lg_error_handler)(lg_error *,void*) ? That seems to be the most common style for this.

Initially I did it this way.

The above is the current default behavior, but if the API user would like to change it, to that end I added the linkgrammar_error_severity member to the error callback user-data struct, while sill providing in it the traditional void *.

? But the callback already gets the error severity level in the first argument -- why would it needs a second copy?

Else there is no way to change it without rewriting the default error handler.

? where is that?

  • Specify the error level symbolically (as opposed to initial string in the error message).

numerically is usually easiest, that way, you can just do a if (verbosity>=(level)) instead of if (strcmp(verbosity, level)) but it doesn't matter that much to me.

I'm going for the "principle of least surprise" -- this API should resemble other popular C api's as much as possible, so that new programmers can just look at it, and guess what it does, without having to carefully study it.

That's why I like the void* in the callback -- almost everyone else does it like that.

I think that almost everyone uses enums for error levels. I don't really care if the enums are strings or ints.

I don't really know what "best practices" are for C code error reporting these days. I sort-of knew what they were 20 years ago, but lost touch.

  • In that case, have the severity enum symbols prefixed (lg_error, lg_warn) to prevent possible enum clash.

since its now in the public api, I guess that would be needed...

  • Provide macro/function per level (prt_error(), prt_warn etc()).

sure. you don't have to do that, but it is commonly done...

  • Combine the debug facility with the error facility, using Debug and maybe Trace severity levels.

Yes! they should be combined. Again -- "best practices", "principle of least surprise", and just the right amount of "keep it simple". I think Einstein said "as simple as possible, but no simpler".

  • Add API function char *lg_error_severity_lable(int) to return the corresponding string for the severity level (to be used internally, and by the API user only when message customizing is needed).

Yes, this seems to be typical.

ampli commented 7 years ago

? But the callback already gets the error severity level in the first argument -- why would it needs a second copy?

The callback function is called as following: lg_error_handler(lg_error *, lg_error_data *) The first argument indeed includes the severity of the message, along with the message itself (the per-error data). On the other hand, lg_error_data includes the API user-data of this callback function, and my proposal is to have in it a "severity threshold" value from which to print to stderr, so only the selected levels (default to Error and Fatal) will be logged to stderr. However, the user will still be able change it easily in my proposed implementation by assigning this field another value (so also warnings will got to stderr, for example). Without that facility, the user will not be able to change this behavior without rewriting the error callback function, or all the messages will go to stderr (especially including debug messages, a thing which will annoy me much...). Note that even with my proposal, the API user doesn't typically have to set the callback data and may leave it as NULL (its default) - in which case the default action of logging only Error and Fatal to stderr will be taken.

Else there is no way to change it without rewriting the default error handler.

? where is that?

It is currently defined as static function, and not directly provided in the API. The only way to currently get it when desired (and it can be desired for several reasons), is to use something like the following as the first call to lg_set_error_handler(): old_error_handler = lg_set_error_handler(NULL, NULL); Alternatively, if desired, default_error_handler() can be renamed to lg_default_error_handler() and provided in the API.

(verbosity>=(level))" instead of if (strcmp(verbosity, level)) bit it doesn't matter that much to me.

An integer value is especifically good for easy > or < comparisons.

I don't really know what "best practices" are for C code error reporting these days. I sort-of knew what they were 20 years ago, but lost touch.

Maybe something very complex like in libssl, which also allows error stack manipulation and fine-grain error formatting, or more simple like in libvirt. A good error facility may even allow setting error handler per error type, with inheritance etc. I think all of that is too complex for our simple needs so I tried to implement a minimal error facility (the error stack doesn't belong to a minimal implementation and I'm not sure how much it is useful, but the code for it is small and already implemented).

(While mentioning an error type (above) - I thought that maybe an error type member can be added to struct lg_error (like Dictionary, Sentence, Option etc.) But such enhancements can be added also later.)

So a remaining question is whether linkgrammar_clear_errors() is needed (and if so - with or without argument), or just linkgrammar_print_errors(NULL, NULL) can be used (which means "print" the error stuck with NULL lg_error_handler callback function (the second NULL is the callback data). But this is not least-surprise.

ampli commented 7 years ago

(there's no public api for the tokenizer.... maybe there should be?)

The idea of issue #420 is an API to the results of the tokenizer. I would like to implement it just after new error handling facility. Another API can be to get different word "role-tagging" in a programmatic way instead of textually analyzing the word infix-marks, guess marks and subscripts. Yet another API can enable to navigate in the wordgraph. However, I don't know what is the best practice to allow that in a user API.

I also had several proposals of how to define the whole tokenizer operation tokenizer in a definition file - which is a kind of API to define a tokenizer (one idea is to use regexp-like definitions - partially implemented - differently than the demo in the current code). Another idea, which may be very appealing, is to define the tokens type order by LG grammar. After all, which token type can come after which one is a kind of grammar. (Maybe even the post processing can be described by a link-grammar - a grammar of link types.)

linas commented 7 years ago

ncludes the API user-data of this callback function,

OK, I understand what you are doing, now. I guess its OK.

linkgrammar_clear_errors()

Yeah, that's probably needed.

ampli commented 7 years ago

Hi Linas, My question is whether I still need to modify anything before I send a PR. I propose to incorporate in its current form, while declaring it as "experimental API" so we can improve it if needed without a need to issue major versions due possibble incompatible changes in it.

Here is the current status of the new error handling code: It is almost done - only some polishing is needed. Things that I changed:

Current C API: 1) lg_error_handler lg_error_set_handler(lg_error_handler, void *data); Set error handler. Return the previous one. On first call it returns the default handler. If the error handler is set to NULL, the messages are just queued. If data is not NULL, it is an (int *) severity_level. Messages with <= this level are printed to stdout. The default is to print Debug and lower to stdout. For custom error handler it can be of course a more complex user-data.

2) const void *lg_error_set_handler_data(void * data); Return the current error handler user data. (This is the function I added. It is needed for correct language bindings.)

3) char *lg_error_formatmsg(lg_errinfo *lge); Format the argument message. It adds "link-grammar" and severity.

4) int lg_error_printall(lg_error_handler, void *data); Print all the queued error messages and clear the queue. Return the number of messages.

5) int lg_error_clearall(void); Clear the queue of error messages. Return the number of messages.

6) int prt_error(const char *fmt, ...); It is already in the API, but I changed it to return int so it can be used in complex macros (which make it necessary to use the comma operator). prt_error() still gets the severity label as a message prefix (the default is lg_None, i.e. a plain message). However, the enum severity code can be specified with err_msg(). When both are specified, the label takes precedence. All of these ways have their use in the code. The code also supports issuing a message in parts. It is collected until its end and issued as one complete message (issuing an embedded newline is supported and used). EDIT: An incompatible change I had to introduce is adding a newline at end of messages, because I removed the implicit newline. But it turns out that many of existing prt_error() already include a terminating newline.

Also the following is in the public API:

typedef enum
{
    lg_Fatal = 1,
    lg_Error,
    lg_Warn,
    lg_Info,
    lg_Debug,
    lg_Trace,
    lg_None
} lg_error_severity;

/* Raw error message. */
typedef struct
{
    /* err_ctxt ec; */
    lg_error_severity severity;
    const char *severity_label;
    const char *text;
} lg_errinfo;

lgdebug() usually uses the severity level lg_Trace (but sometimes lg_Debug or lg_Info).

Regarding lg_errinfo I was not sure whether to expose it, or just provide several functions to extract its components (currently 3 components, but more can be added, like file, line-number, error context, and more). Another possibility is to provide a printf-like format string to lg_error_formatmsg(), like (actually a more complex format is needed): lg_error_formatmsg("%{libname} %{seveirty}%{message}"). Later things like filename etc. can be added. For now I decided to expose it.

Current Python API:

I also created Python bindings for the new error facility, include a test case (which is not simple). Most of the added functions are static methods of a new class LG_Error. This class also serves as the root exception class for the library.

LG_Error.set_handler()
LG_Error.printall()
LG_Error.clearall()
LG_Error.message()     # prt_error()
errinfo.formatmsg()       # errinfo is the first argument of the error handler

I started to write A C GUI (GTK+ based, with a minimal change in the current link-parser program) that needs an error facility like the one I added, and can serve as a test if it is adequate for the job. However, I first intend to implement #420 (and maybe more) before I return to it.

No other language bindings yet, but there is nothing to hurry since the new error facility is totally compatible to the current code (absolutely no changes in link-parser). EDIT: Now that I check it I see I had to add a newline at the end of 5 prt_error() messages.

Internally (defined in error.c) the error facility data structure is:

static TLS struct
{
    lg_error_handler handler;
    void *handler_data;
    lg_errinfo *errmsg;
} lg_error = { default_error_handler };

TLS is defined by configure to the keyword for Thread Local Storage (or nothing if it is not supported).

ampli commented 7 years ago

I completed these change (improve error reporting). It also contains Python bindings and test suite extensions. I had to inspect all the error printouts in the library, and fixed a large number of them.

Among the fixes:

Features of the improve error reporting facility:

I propose the following line in ChangeLog:

I also can add a file with summarizes the internal API of the new error facility (a short version of my message above).

I would like to send this change as PR right away, because I already had to handle several merge conflicts due to you recent changes. (The strangest one was on Changelog, which I haven't changed yet).

P.S. I already have the spell correction (issue #404) branch , which includes some important fixes. I need your input on my posts there so I will be able to send the good parts of it before heavy merge conflicts will happen. I also have almost ready the "getword" cranch (issue #420) that I would like to send next. (The empty-word elimination is not ready yet - I'm now redoing my SAT-parser fixes which were not good enough - they give the same parses even though they are not really correct.)