opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

link-generator -l en uses unexpectedly high memory and CPU #1501

Open ryandesign opened 4 months ago

ryandesign commented 4 months ago

I've built link-grammar 5.12.4 from source on macOS 12. If I run link-generator -l en to specify English, since it defaults to Lithuanian (#1499), it hangs with 100% CPU use while consuming all memory. I cancel it after about a minute when it has consumed over 15 GB of memory (my computer has 16 GB of real RAM) and started allocating ever more swap space.

% time link-generator -l en
#
# Corpus for language: "en"
# Sentence length: 6
# Requested number of linkages: 500
# Requested number to print: 20
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: en: Spell checker disabled.
# Dictionary version 5.12.4
# Number of categories: 1719
^C
link-generator -l en  38.45s user 18.57s system 97% cpu 58.379 total

In case relevant, I am using --disable-pcre2 --with-regexlib=c since it cannot build when using pcre2 (#1495).

% link-generator --version 
Version: link-grammar-5.12.4
Compiled with: /usr/bin/clang __VERSION__="Apple LLVM 13.0.0 (clang-1300.0.29.30)"  
OS: darwin21.6.0 __APPLE__ __MACH__ 
Standards: __STDC_VERSION__=201112L 
Configuration (source code):
    CPPFLAGS=-I/opt/local/include -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk
    CFLAGS=-D_DEFAULT_SOURCE -std=c11 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -D_ISOC11_SOURCE -fvisibility=hidden -pipe -Os -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -arch x86_64
Configuration (features):
    DICTIONARY_DIR=/opt/local/share/link-grammar
    -DPACKAGE_NAME="link-grammar" -DPACKAGE_TARNAME="link-grammar" -DPACKAGE_VERSION="5.12.4" -DPACKAGE_STRING="link-grammar 5.12.4" -DPACKAGE_BUGREPORT="https://github.com/opencog/link-grammar" -DPACKAGE_URL="https://opencog.github.io/link-grammar-website" -DPACKAGE="link-grammar" -DVERSION="5.12.4" -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DYYTEXT_POINTER=1 -DHAVE_STRNDUP=1 -DHAVE_STRTOK_R=1 -DHAVE_SIGACTION=1 -DHAVE_ALIGNED_ALLOC=1 -DHAVE_POSIX_MEMALIGN=1 -DHAVE_ALLOCA_H=1 -DHAVE_ALLOCA=1 -DHAVE_FORK=1 -DHAVE_VFORK=1 -DHAVE_WORKING_VFORK=1 -DHAVE_WORKING_FORK=1 -D__STDC_FORMAT_MACROS=1 -D__STDC_LIMIT_MACROS=1 -DTLS=_Thread_local -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_VISIBILITY=1 -DHAVE_LOCALE_T_IN_XLOCALE_H=1 -DHAVE_XLOCALE_H=1 -DHAVE_STDATOMIC_H=1 -DHAVE_MKLIT=1 -DUSE_SAT_SOLVER=1 -DUSE_WORDGRAPH_DISPLAY=1 -DHAVE_SQLITE3=1 -DHAVE_HUNSPELL=1 -DHUNSPELL_DICT_DIR="/Library/Spelling" -DHAVE_EDITLINE=1 -DHAVE_WIDECHAR_EDITLINE=1 -DHAVE_REGEX_H=1 -DHAVE_REGEXEC=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1
ampli commented 4 months ago

Here it is on a fast Linux machine (high-end i9 with 64GB memory). I added --verbosity=2 to see what it is doing.

% time link-generator -l en --verbosity=2
#
# Corpus for language: "en"
# Sentence length: 6
# Requested number of linkages: 500
# Requested number to print: 20
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: en: Spell checker disabled.
# Dictionary version 5.12.4
# Number of categories: 1719
#### Finished tokenizing (8 tokens)
++++ Finished expression pruning                 0.09 seconds
#### Creating a wild-card word disjunct list
#### Finished tokenizing (3 tokens)
++++ Finished creating list: 3555391 disjuncts   13.08 seconds
++++ Built disjuncts                             7.88 seconds
++++ Eliminated duplicate disjuncts             25.20 seconds
++++ Encoded for pruning                         1.57 seconds
++++ power pruned (for 0 nulls)                  1.77 seconds
++++ Built mlink_table                           0.02 seconds
++++ power pruned (for 0 nulls)                  0.15 seconds
++++ pp pruning                                  0.01 seconds
++++ Encoded for parsing                         0.14 seconds
++++ Initialized fast matcher                    0.85 seconds
++++ Counted parses (2147483647 w/0 nulls)      47.71 seconds
++++ Built parse set                            28.66 seconds
++++ Postprocessed all linkages                  0.77 seconds
++++ Sorted all linkages                         0.00 seconds
++++ Finished parse                              0.00 seconds
# Linkages found: 2147483647
# Linkages generated: 201
# Number of unused disjuncts: 3283842
#
LEFT-WALL 'twas it ? report ? why RIGHT-WALL
LEFT-WALL apparently thrust . what ؟ how RIGHT-WALL
LEFT-WALL soon after questions , about aboard RIGHT-WALL
LEFT-WALL how much MP ! what am RIGHT-WALL
LEFT-WALL yo gettin' proved . left commanded RIGHT-WALL
LEFT-WALL to quit so : what am RIGHT-WALL
LEFT-WALL to decide . … less . RIGHT-WALL
LEFT-WALL [ 80 read . smell RIGHT-WALL RIGHT-WALL
LEFT-WALL handy when prevaricate pre-requisites -- forgotten RIGHT-WALL
LEFT-WALL having said so reeked , ain't RIGHT-WALL
LEFT-WALL and/or going wrote ! the_fuck … RIGHT-WALL
LEFT-WALL ’tis that known ؟ diabetes proved RIGHT-WALL
LEFT-WALL so could ; wish 〈 appearing RIGHT-WALL
LEFT-WALL so apparently asked ; didn’t ? RIGHT-WALL
LEFT-WALL ask to ask so tend ? RIGHT-WALL
LEFT-WALL what am ! wasn't this acting RIGHT-WALL
LEFT-WALL typically proclaimed 、 be . know RIGHT-WALL
LEFT-WALL to LEFT-WALL … Romania AJ v.v. RIGHT-WALL
LEFT-WALL handy did : have the_same to RIGHT-WALL
LEFT-WALL … some_more . felt … RIGHT-WALL RIGHT-WALL
# Bye.
128.35u 8.16s 2:17.25e 17758976Kb

You can see that it took about 136s CPU. So it seems you just need to wait some more. If something gets stuck, we will be able to see in which step (due to the verbosity setup, which may be increased for more details).

ryandesign commented 4 months ago

Thank you, you're right, it did complete after around five minutes.

% time link-generator -l en --verbosity=2
#
# Corpus for language: "en"
# Sentence length: 6
# Requested number of linkages: 500
# Requested number to print: 20
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: en: Spell checker disabled.
# Dictionary version 5.12.4
# Number of categories: 1719
#### Finished tokenizing (8 tokens)
++++ Finished expression pruning                 0.13 seconds
#### Creating a wild-card word disjunct list
#### Finished tokenizing (3 tokens)
++++ Finished creating list: 3555391 disjuncts   19.82 seconds
++++ Built disjuncts                            24.92 seconds
++++ Eliminated duplicate disjuncts             58.10 seconds
++++ Encoded for pruning                        10.31 seconds
++++ power pruned (for 0 nulls)                  2.10 seconds
++++ Built mlink_table                           0.01 seconds
++++ power pruned (for 0 nulls)                  0.25 seconds
++++ pp pruning                                  0.01 seconds
++++ Encoded for parsing                         0.09 seconds
++++ Initialized fast matcher                    0.07 seconds
++++ Counted parses (2147483647 w/0 nulls)      93.41 seconds
++++ Built parse set                            51.61 seconds
++++ Postprocessed all linkages                  0.96 seconds
++++ Sorted all linkages                         0.00 seconds
++++ Finished parse                              0.01 seconds
# Linkages found: 2147483647
# Linkages generated: 226
# Number of unused disjuncts: 3283842
#
LEFT-WALL why asked all . continue . RIGHT-WALL
LEFT-WALL so could ? did . let_go RIGHT-WALL
LEFT-WALL kept voted ‽ what . why RIGHT-WALL
LEFT-WALL got needed . getting ⅜ tonight RIGHT-WALL
LEFT-WALL why be told , ain't goin' RIGHT-WALL
LEFT-WALL keep proven so how . not_enough RIGHT-WALL
LEFT-WALL … around any 1 and_a_half ready RIGHT-WALL
LEFT-WALL determine — gettin' forward . continued RIGHT-WALL
LEFT-WALL diabetes proved : don’t take so RIGHT-WALL
LEFT-WALL which became said so daren't ‽ RIGHT-WALL
LEFT-WALL why become told : gotten ؟ RIGHT-WALL
LEFT-WALL so know ? how ! when RIGHT-WALL
LEFT-WALL what keeps known so apparently quit RIGHT-WALL
LEFT-WALL `` wish 」 why propose to RIGHT-WALL
LEFT-WALL … appear so later . stuck RIGHT-WALL
LEFT-WALL which comes ask , daren’t . RIGHT-WALL
LEFT-WALL bet : asked once so later RIGHT-WALL
LEFT-WALL usually asked ‽ mad : may RIGHT-WALL
LEFT-WALL reply maybe so ... so envious RIGHT-WALL
LEFT-WALL who ain't . be whoever precisely RIGHT-WALL
# Bye.
link-generator -l en --verbosity=2  215.63s user 48.26s system 92% cpu 4:45.02 total

I guess if this is expected behavior then there is no bug. It seemed like a bug to me because with Lithuanian or German it returned results instantly. Using up all my memory and CPU for over a minute with no output seemed like an infinite loop allocating unbounded amounts of memory.

I had read in #1020 that the code originated in 1995 and that while memory usage was a concern then it isn't anymore on modern machines. It says there that machines with hundreds of megabytes of memory or less would be memory constrained, thus I didn't think that my 16GB machine would be. I had also seen this comment in the source reinforcing that:

https://github.com/opencog/link-grammar/blob/3a2612761f17e2579cb213a02aa1cef394f98763/link-grammar/utilities.c#L368-L372

ampli commented 4 months ago

I guess if this is expected behavior then there is no bug.

It is not a "bug" - it is due to a huge number of disjuncts. Most CPU time could be saved by randomly discarding most disjuncts before the "Eliminated duplicate disjuncts" step. Additional speed and usefulness could be added by an API to select desired features for the generated sentences (like ending punctuations only as the last word and which type and amount of punctuation in the middle). See below for an example of current available constraints. I have also numerous efficiency fixes that I would like to send (as time permits).

Using up all my memory and CPU for over a minute with no output seemed like an infinite loop allocating unbounded amounts of memory.

As you can see, it is not an infinite loop, as it eventually finishes (even if you try to generate longer sentences). My "time" prompt said it used 17758976Kb of memory - about 17GB. This is more memory that you have, a thing which may cause page swapping (indicated by 92% CPU that you got, instead of ~99.5% that I got).

Now to an example of using the 4 types of constraints that are implemented just now, demonstrating the speed up:

% echo 'This \* a te\* of \*.a sentence generation' | time link-generator -l en -s 0 --verbosity=2
#
# Corpus for language: "en"
# Requested number of linkages: 500
# Requested number to print: 20
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
link-grammar: Info: en: Spell checker disabled.
# Dictionary version 5.12.4
# Number of categories: 1719
# Sentence template: This \* a te\* of \*.a sentence generation

#### Finished tokenizing (10 tokens)
++++ Finished expression pruning                 0.09 seconds
#### Creating a wild-card word disjunct list
#### Finished tokenizing (3 tokens)
++++ Finished creating list: 3555391 disjuncts   12.98 seconds
++++ Built disjuncts                             0.34 seconds
++++ Eliminated duplicate disjuncts              0.44 seconds
++++ Encoded for pruning                         0.28 seconds
++++ power pruned (for 0 nulls)                  0.09 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 0 nulls)                  0.00 seconds
++++ pp pruning                                  0.00 seconds
++++ power pruned (for 0 nulls)                  0.00 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 0 nulls)                  0.00 seconds
++++ Encoded for parsing                         0.00 seconds
++++ Initialized fast matcher                    0.00 seconds
++++ Counted parses (361757657 w/0 nulls)        0.01 seconds
++++ Built parse set                             0.00 seconds
++++ Postprocessed all linkages                  0.02 seconds
++++ Sorted all linkages                         0.00 seconds
++++ Finished parse                              0.00 seconds
# Linkages found: 361757657
# Linkages generated: 382
# Number of unused disjuncts: 3555258
#
LEFT-WALL this ? a tenth of northwest sentence generation RIGHT-WALL
LEFT-WALL this ? a tenth of ready sentence generation RIGHT-WALL
LEFT-WALL this helps a tenth of north sentence generation RIGHT-WALL
LEFT-WALL this . a tenth of alone sentence generation RIGHT-WALL
LEFT-WALL this ? a tenth of northeast sentence generation RIGHT-WALL
LEFT-WALL this . a tenth of ready sentence generation RIGHT-WALL
LEFT-WALL this . a tenth of fifty-second sentence generation RIGHT-WALL
LEFT-WALL this persuaded a tenth of ready sentence generation RIGHT-WALL
LEFT-WALL this advised a tenth of fourty-ninth sentence generation RIGHT-WALL
LEFT-WALL this taught a tenth of seventy-ninth sentence generation RIGHT-WALL
LEFT-WALL this ‧ a tenth of twenty-eighth sentence generation RIGHT-WALL
LEFT-WALL this — a tenth of ready sentence generation RIGHT-WALL
LEFT-WALL this : a tenth of ready sentence generation RIGHT-WALL
LEFT-WALL this denied a tercentennial of egregious sentence generation RIGHT-WALL
LEFT-WALL this reminds a teahouse of enough sentence generation RIGHT-WALL
LEFT-WALL this advised a tentacle of all sentence generation RIGHT-WALL
LEFT-WALL this . a tee of circumspect sentence generation RIGHT-WALL
LEFT-WALL this , a technologist of all sentence generation RIGHT-WALL
LEFT-WALL this shouted a template of enough sentence generation RIGHT-WALL
LEFT-WALL this ? a telecast of all sentence generation RIGHT-WALL
# Bye.
14.65u 2.84s 0:17.59e 6374676Kb

The step "Finished creating list" too ~13 seconds (it can be optimized to almost nothing). So the actual generation took only ~4.5 seconds (code optimizations can drastically reduce it). The type of constraints that were used here:

  1. Desired words: The word itself.
  2. Any word: *
  3. Desired word start: \te*
  4. Desired LG dictionary subscript: *.a (a subscript can be attached also to types 1-3).

More attributes, like optional words, or the ability to specify a mix of attributes, may be added, but it seems it is desired to do it in a more general way. One of these ways is to add filtering plugins to specify attributes to any desired group of words (including individual words and the whole sentence).

BTW, I think that the repeating template-replacing words or word classes in the generated sentences may hint at a bug.

ryandesign commented 4 months ago

As you can see, it is not an infinite loop, as it eventually finishes

Right, I see that now. I was explaining what my impression was at the time that I filed the bug report.

linas commented 4 months ago

FYI I have recently started making plans to rethink/rework/redo the generator and how generation works, however, I won't be able to start anything for months. The ideas are sufficiently vague, that I can't explain them without a lot of effort. Something something more flexible word order.

I would like to have the link-generator to use the same user-interface as link-parser: so that one could type in "John * by the Red Sea." have the generator fill in those blanks, and then have a prompt so I can do it again.

I would also like to be able to say something like "the first should be verb-like and the second should be adverb-like" or even "adverb-like denoting larger or bigger". but I haven't yet tried to imagine a detailed API for that. "adverb-like approximate synonym for bigger"

linas commented 4 months ago

So this comment:

More attributes, like optional words, or the ability to specify a mix of attributes, may be added, but it seems it is desired to do it in a more general way.

yes. The following PDF's give a general feel and inspiration for the "more general way". https://kahanedotfr.files.wordpress.com/2017/01/mtt-handbook2003.pdf

It gives the general flavor. How to actually do it, to fill in all the precise bits and pieces... create something usable rather than theoretical, that's the hard part. I don't know.

ampli commented 4 months ago

I would like to have the link-generator to use the same user-interface as link-parser: so that one could type in "John * by the Red Sea." have the generator fill in those blanks, and then have a prompt so I can do it again.

Now you can get a one-time (non-editable) prompt by link-generator -s 0. To get the link-parser user interface, we can convert link-generator to a function, and modify link-parser to call it when invoked with, say, --link-generator. In that case, it will internally switch to a variable table specific to link-generator (many variables may be common, like limit, cost-max, and, and more).

It gives the general flavor. How to actually do it, to fill in all the precise bits and pieces... create something usable rather than theoretical, that's the hard part. I don't know.

I also don't know how to jump to such sophistication. I meant a much basic and simpler implementation (by several degrees), which MTT may inspire and maybe implement some MTT-like "easy" ideas.

I would also like to be able to say something like "the first should be verb-like and the second should be adverb-like" or even "adverb-like denoting larger or bigger". but I haven't yet tried to imagine a detailed API for that. "adverb-like approximate synonym for bigger"

For that (and for implementing an MTT interface), we need an external dictionary that lists the word's grammatical features. Or maybe a 4.0.pos dict which will define POS keyword with a DSL that describes disjunct patterns.

linas commented 4 months ago

... jump to such sophistication ...

... we need an external dictionary ...

About five years ago, I spoke to someone who proudly proclaimed about how their company is able to do so well, because they implemented MTT (as a proprietary product in their software stack.) So, there is certainly a "proof of concept".

The "biggest" idea in MTT is the "lexical function" LF in wikipedia and if you keep your nose to the grindstone, its not hard to figure out how to implement that, and it would even be fun. But ...

But of course one needs to have some "external dictionary", which means creating that dictionary. There are two ways to do this:

  1. The "classical", hand-crafted custom-built dictionary stored in a text file.
  2. A machine-learned dictionary.

As I hope its clear, the hand-crafted version will never do. That leaves option two: machine learning. But how? How would that work? I'm not sure. I think I know how to get to there, but ...

But of course, there are two ways to machine learn:

  1. A. The "classical" way of defining exactly which structures should be learned (e.g. by reading the wikipedia article, and using that to define fixed, explicit C structs or C++ classes) and design ML code to fill in those classes, to learn data that fits exactly into those pre-defined classes.

  2. B. Machine-learn the data structure to use.

Of course, 2.A. is relatively easy, but is also a kind-of dead-end. Option B is harder. I'm pursing option B. Let me explain ... (next post)

linas commented 4 months ago

I think I can bootstrap up to MTT. I've gone half-way up the bootstrap, and it works well, but ran into maintainability problems. The bootstrap is this:

I've done the above. It more-or-less works. The problem is that it runs as large batch jobs, taking days or weeks to run. The training framework is fragile. Bugs in the code means throwing away datasets and starting from scratch. Very tedious.

To fix that, I'm redesigning the pipeline, going back to square one, and creating a data-flow architecture instead of a batch process. But this is a lot of work. (I've got no help.) If/when I get this working, I can resume the journey:

I said "I think it'll work" because the above has been "done already" by assorted academics and grad students working in hand-crafted perl scripts and java monstrosities. As usual, that code is badly-written, obsolete, and can no longer be found on the internet, if it was ever published to begin with. What I'm doing is trying to create a single unified framework that can integrate this, "organically", instead of being a bunch of stove-piped java classes and perl scripts. However, doing this is ... hard. It's like that country-western song: "one step forward and two steps back, you can't get too far like that".

That's the plan. It made sense 5 or 10 years ago. Hidden under the surface are a bunch of other technical and philosophical and design and engineering questions, some meta-level questions, and I'm struggling with the meta-level questions now.

So when I say "read about MTT" I don't mean "jump to start coding now", but more like "day-dream about how that could be done." and more importantly "what's wrong with brute-force MTT aka option 1 or 2.A."

I dunno.

linas commented 4 months ago

@ampli Here's one more, for your horror and delight. The demo located here: https://github.com/opencog/sensory/blob/master/examples/filesys.scm creates a link-grammar style API into a"sensori-motor" or "perception-action" environment. The possible "perceptions" are described as connectors, and so are the possible "actions". The "environment" is supposed to be "the "external world", something outside of the "agent". For this demo, this external word is the unix file system, and the (eventual) goal of the demo is to build an agent that bounces around the file-system, doing things. There's another one, for IRC chat.

For example, the unix "ls" command becomes a disjunct; in that disjunct is a couple of "+" direction connectors that must connect in order for the command to run, and the "-" connector must be connected to something that can accept the output of "ls" (which is a text stream).

If the agent wants to perform some action in this external world, it just needs to hook up appropriate stuff to each of the connectors, and then say "go". Data will then flow across the links. So, each LG link becomes somewhat like a unix pipe, with data flowing whichever way. The pipes are typed, to describe the kind of data that flow in them. That is, the LG connector rules must be obeyed, in order to maintain compatibility for the data flowing in those pipes.

The demo is horribly ugly and complicated, because LG is not actually being used. Instead, the demo painfully dissects each disjunct (called a Section) and looks at each connector, one by one, and tries to manually hook stuff up to it. Ugh. The long-term plan is to use LG to do this hooking-together of things. However, I'm still very far away from that. All sorts of mundane engineering issues block the path.

There's also an IRC chatbot demo there, the only thing the chatbot does is echo. you can think of the chatbot as an LG parse of one word: LEFT-WALL the input, a processing node, and RIGHT-WALL the output. The processing node just copies from input to output, so its an echobot.

Why am I mentioning this here? I envision having agents "do stuff" by hooking together these "pipes", using LG-style connector-type matching rules. Some of the processing stages could be wild-cards, and so link-generator would be used to find valid linkages, instead of link-parser.

Anyway, this is all still a daydream. An experiment in progress. Maybe doing things this way will turn out to be interesting and useful, and maybe it will turn out to be a dumb trick. Not sure. The demos work. I'm still a ways away from actually being able to use LG to perform the hookups.

ampli commented 4 months ago

Could you create an LG dictionary on the fly, parse (or generate) a "sentence", and then use the linkage? BTW, do links ever need to be crossed for command descriptions?

ampli commented 4 months ago

I said:

we need an external dictionary that lists the word's grammatical features.

By external dictionary, I meant something like a database derived from Wiktionary and other similar resources.

I would also like to be able to say something like "the first should be verb-like and the second should be adverb-like" or even "adverb-like denoting larger or bigger".

How would it know what is any of "verb-like", "adverb-like", "adverb-like denoting larger etc.? I even don't have any idea how it will know what a simple "verb" is or what are tenses, unless you tell it or it uses deep learning.

linas commented 4 months ago

Could you create an LG dictionary on the fly, parse (or generate) a "sentence", and then use the linkage?

Being able to modify dictionaries on the fly is needed. Not a problem: the LG atomese backend already does this, with the "main" dictionaries being stored in the AtomSpace, and small subsets getting placed into the LG dict-ram dictionary. So, dict-ram acts like a private LG cache that's a window onto a larger dict (that changes) Words are added to dict-ram only on an as-needed basis.

BTW, do links ever need to be crossed for command descriptions?

Dunno. Maybe. Haven't gotten that far. Anyway, link-crossing is "not a problem": If you want A to cross B, then just write B+ or (A- & B+ & A+). Also FYI, turns out one of the weird operations in "categorial grammar" is just link-crossing. It was an oh-thats-interesting moment when I figured it out, because otherwise it was this weird hard-to-understand strange thing. Details: https://github.com/opencog/atomspace/raw/master/opencog/sheaf/docs/ccg.pdf

linas commented 4 months ago

we need an external dictionary that lists the word's grammatical features.

By external dictionary, I meant something like a database derived from Wiktionary and other similar resources.

I wish to avoid Wiktionary, for multiple reasons. I'm interested in systems that learn this info on their own.

I would also like to be able to say something like "the first should be verb-like and the second should be adverb-like" or even "adverb-like denoting larger or bigger".

How would it know what is any of "verb-like", "adverb-like", "adverb-like denoting larger etc.?

Solutions range from cheap and easy to complicated and powerful. The dumb cheap solution is just to use the existing UNKNOWN-WORD.v vs UNKNOWN-WORD.a The fancy solutions won't fit in the margin of this page.

I even don't have any idea how it will know what a simple "verb" is or what are tenses, unless you tell it or it uses deep learning.

Bingo. Except that are many other kinds of learning besides deep learning. Deep learning is currently winning due to certain ease-of-scaling, ease-of-compute issues. I think some of the other learning algos could also scale, but perhaps DL has sucked all the air out of the room.