opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 119 forks source link

The `vn` dictionary #1202

Open ampli opened 3 years ago

ampli commented 3 years ago

I wondered why this dict has only 5 sentence examples and none of them can be fully parsed. The project in this URL has recently been transferred to https://bitbucketarchive.softwareheritage.org/projects/ng/ngocminh/lienkate.html, from which I was able to restore the project. And indeed data/vn/linkgrammar.vi.txt is from there.

Interestingly, the site also has an article in Vietnamese on Vietnamese to English translation using Annotated Disjuncts, that apparently refers to a more advanced dictionary (more connector names are mentioned in its text and appear in its diagrams).

In Wikipedia, there is a diagram of "Bữa tiệc hôm qua là một thành công lớn". The words "Bữa" (and "bữa"), and "hôm, which appear linked in that diagram, are not found in data/vn/4.0.dict (and also not in the original one from the archive).

I found the diagram in the English article describing this work: https://ictmag.vn/ict/article/view/328/pdf (hard to read - apparently a photocopy). But since the diagram in Wikipedia is in text, I guess it is from this article in text form. I hoped to find there more examples that can be copied to the corpus file, but I cannot find it.

EDIT: The article in the references of the link-grammar Wikipedia is also of a more advanced dictionary.

ampli commented 3 years ago

(Continuing the discussion from PR #1201.)

Beats me. I looked at the second sentence: "tôi mua một bông hoa" and two of the four words are not in the dictionary. Given that the dictionary is of a reasonable size...

In the different articles that I was able to find, there, there are connectors that are not in the current dict, so I suspect this is not the final dict of this project.

ampli commented 3 years ago

The vietnamese dict came from here: https://www.researchgate.net/publication/287444370_Parsing_complex_-_compound_sentences_with_an_extension_of_Vietnamese_link_parser_combined_with_discourse_segmenter

A copy of this article (same publication) is accessible through the link in the link-grammar Wikipedia page. Interestingly, the connector strings they use in it don't resemble at all the ones used in the current dict. Also, many of these strings are not compatible with the LG C implementation, e.g:

B ảng 1. Công thức liên kết của các từ

Từ Công thức
tôi SV+
mua SV- & {O+}
một McNt+
bông NcNt3+
hoa O- & {McNt- &NcNt3-}