opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

link-generator defaults to Lithuanian #1499

Open ryandesign opened 4 months ago

ryandesign commented 4 months ago

I'm using link-grammar for the first time without knowing anything about it nor having read a great deal of documentation. I've built version 5.12.4 from source on macOS 12.

I ran link-generator with no arguments and it said:

% link-generator                                
#
# Corpus for language: "lt"
# Sentence length: 6
# Requested number of linkages: 500
# Requested number to print: 20
link-grammar: Info: Dictionary found at /opt/local/share/link-grammar/lt/4.0.dict
link-grammar: Info: lt: Spell checker disabled.
# Dictionary version 5.11.0
# Number of categories: 431
# Linkages found: 141388
# Linkages generated: 389
# Number of unused disjuncts: 1438
#
LEFT-WALL au =ga : gyvename namo .
LEFT-WALL au =gtume ; einam namo !
LEFT-WALL au =gsiu : einame namo !
LEFT-WALL au =gai ; gyvename namo !
LEFT-WALL au =gdavo , einam namo .
LEFT-WALL au =gi ; einame namo !
LEFT-WALL au =game ; gyvename namo ?
LEFT-WALL au =gtų ; einame namo ?
LEFT-WALL au =gs : gyvenam namo !
LEFT-WALL au =gtume , einam namo ?
LEFT-WALL au =gs ; einam namo .
LEFT-WALL au =gam ; einam namo .
LEFT-WALL au =ga ; einam namo .
LEFT-WALL au =ga : gyvename namo ?
LEFT-WALL au =gate , einame namo .
LEFT-WALL au =gdavai , einame namo !
LEFT-WALL au =gtų , gyvename namo ?
LEFT-WALL au =gs : einam namo !
LEFT-WALL au =gi ; einame namo .
LEFT-WALL au =gdavome , einame namo !
# Bye.

Language "lt" is Lithuanian, yes? It surprises me to see the program default to Lithuanian when I am located in the United States with typical English locale settings:

% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
% link-generator --version
Version: link-grammar-5.12.4
Compiled with: /usr/bin/clang __VERSION__="Apple LLVM 13.0.0 (clang-1300.0.29.30)"  
OS: darwin21.6.0 __APPLE__ __MACH__ 
Standards: __STDC_VERSION__=201112L 
Configuration (source code):
    CPPFLAGS=-I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk
    CFLAGS=-D_DEFAULT_SOURCE -std=c11 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -D_ISOC11_SOURCE -fvisibility=hidden -pipe -Os -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -arch x86_64
Configuration (features):
    DICTIONARY_DIR=/opt/local/share/link-grammar
    -DPACKAGE_NAME="link-grammar" -DPACKAGE_TARNAME="link-grammar" -DPACKAGE_VERSION="5.12.4" -DPACKAGE_STRING="link-grammar 5.12.4" -DPACKAGE_BUGREPORT="https://github.com/opencog/link-grammar" -DPACKAGE_URL="https://opencog.github.io/link-grammar-website" -DPACKAGE="link-grammar" -DVERSION="5.12.4" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DYYTEXT_POINTER=1 -DHAVE_STRNDUP=1 -DHAVE_STRTOK_R=1 -DHAVE_SIGACTION=1 -DHAVE_ALIGNED_ALLOC=1 -DHAVE_POSIX_MEMALIGN=1 -DHAVE_ALLOCA_H=1 -DHAVE_ALLOCA=1 -DHAVE_FORK=1 -DHAVE_VFORK=1 -DHAVE_WORKING_VFORK=1 -DHAVE_WORKING_FORK=1 -D__STDC_FORMAT_MACROS=1 -D__STDC_LIMIT_MACROS=1 -DTLS=_Thread_local -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_VISIBILITY=1 -DHAVE_LOCALE_T_IN_XLOCALE_H=1 -DHAVE_XLOCALE_H=1 -DHAVE_STDATOMIC_H=1 -DHAVE_MKLIT=1 -DUSE_SAT_SOLVER=1 -DUSE_WORDGRAPH_DISPLAY=1 -DHAVE_SQLITE3=1 -DHAVE_HUNSPELL=1 -DHUNSPELL_DICT_DIR="/Library/Spelling" -DHAVE_EDITLINE=1 -DHAVE_WIDECHAR_EDITLINE=1 -DHAVE_REGEX_H=1 -DHAVE_REGEXEC=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1
ampli commented 4 months ago

You're right. However, in the current state of link-generator, lt might be its most useful language since it has a small dictionary. en is currently extremely slow for sentences with more than a few words. In the discussion section (or maybe issues), @linas suggested speeding it up by disjunct sampling. Efficiency fixes are needed too. I still need to implement most of that. It also lacks a useful API. Suggestions are welcome.