opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 119 forks source link

Converting cost from string (and also change `cost` from double to float) #1231

Open ampli opened 3 years ago

ampli commented 3 years ago

Just now it is done in strtodC() by invokingstrtof_l() with the C locale. However, if there are systems that don't have this function (those not having locale_t and at least SunOS). These systems depend on setlocale(LC_NUMERIC, "C") in dictionary_six_str(), a thing that is not a good idea to have (because it changes the user locale setup in the program that uses the library and may potentially alter printing/reading numbers in the user's program).

One way to solve this problem is to find out if the current locale uses , as a decimal separator (e.g. when opening the dict), and if so replace the period in the cost string by ,. Then strtof() can be safely used.

I chose another way: Writing a function to directly convert strings to float. My implementation handles numbers from 0.001 to 99.9999 and doesn't handle the exponent notation (it actually accepts more than 4 digits after the decimal point but doesn't convert them). It doesn't use float multiplication, and even not integer multiplication (checked in the optimized code). As a result, it is 3 times faster than the corresponding system function. But since the current dictionaries don't contain really many fractional costs, the speedup is negligible.

My question is whether it is OK to use such a conversion function - that only gets "scaled integer" strings (i.e. w/o an exponent) and ignores (besides syntax checking) more than 4 digits after the decimal point.

In the same branch, I also converted all the remaining double cost to float cost, besides the user API, which can be left as is, and a small speedup resulted.

linas commented 3 years ago

For an ugly problem, this seems like a good solution.

I vaguely recall that, some years ago, I found a solution to this problem by using some obscure C++ function, but I don't recall any details.

ampli commented 3 years ago

setlocale(LC_NUMERIC, "C") in dictionary_six_str(), a thing that is not a good idea to have

I had a problem in eliminating it, because it also controls printing (and numbers could be printed then with a comma). This could be solved by using snprintf_l(), but the *print_l() functions are not yet universal (e.g. missing in Linux). So there is no point, for now, to replace strtof() with a locale-independent one. and I will remove it from the planned patch.

ampli commented 2 years ago

but the *print_l() functions are not yet universal (e.g. missing in Linux)

Since now costs are converted to strings only using cost_stringify(), it will be enough to make this function locale-independent. Along with my solution for locale-independent conversion of strings to float it will finally solve the problem.

ampli commented 2 years ago

I'm now trying to prepare my floating-point handling branch for a PR. Among other things, I would like to eliminate any double-precision conversion.

The primary motivation for using float instead of double was memory footprint. However, small processors may not have hardware double-precision, so using double-precision variables/literals may slow floating-point operation by much, and requires the loader to pull in a double-precision library.

So I would like to change double to float in the following:

double parse_options_get_disjunct_cost(Parse_Options opts)
void parse_options_set_disjunct_cost(Parse_Options opts, double dummy)
double linkgrammar_get_dict_max_disjunct_cost(Dictionary dict)
double linkage_disjunct_cost(const Linkage linkage)
double sentence_disjunct_cost(Sentence sent, LinkageIdx I)

This is a source-code compatible API change so I think there is no problem with it.

On the same occasion (in this branch or probably in another one), I would like to change the function signature of the parse_options_set_*() calls to return a value when the set fails, to be able to return an error indication, e.g. when calling parse_options_set_spell_guess when spell-correction is not supported. (I also thought of normally returning the previous value, with a special error indication according to the value's type, but this is more complicated and doesn't add any feature).

BTW, we could use a scaled-integer cost and eliminate the use of floating-point at all. This would allow using short int as the cost type, saving some memory, and running faster on processors without floating-point hardware support. (If this looks like a good idea, I would just add it as a low priority in my TODO list, since I now primarily try to release old/uncomplete branches as PRs.)

ampli commented 2 years ago

Comments to what I wrote above:

  1. An additional API with double: static inline double lg_exp_get_cost(const Exp* exp).
  2. These changes will not really eliminate double from the library, due to struct Resources_s. However, with slight changes, it is possible to use int there.
linas commented 2 years ago

This is a source-code compatible API change

It does break binary compatibility. However, I do not believe that there are any users that would be affected, so this is OK.

change the function signature of the parse_options_set_*()

Sure, go ahead.

we could use a scaled-integer cost

Yuck, no. All of my work on structure learning computes costs as doubles. The typical range of costs goes from extremes of -40.0 to about +60.0 and requires 4-6 decimal places, which is at the edge of what can be squeezed into a 16-bit short. The typical costs that LG would see are in a narrower range, and require less precision, so, in theory, I guess it's possible to use shorts for costs, but .. yeech. I don't see a pressing need for this.