kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.58k stars 232 forks source link

Using nearley for Parsing Morphology? #181

Closed onechrisjones closed 6 years ago

onechrisjones commented 7 years ago

Working on an app to help linguists develop lexicons for undocumented languages, and wondering if you know if Nearley might be a good tool for parsing morphology?

I'm not an engineer or developer, not super technical... but maybe you guys would have some thoughts on the subject?

Thanks.

tjvr commented 7 years ago

I suspect not (feature structures are usually a good idea for linguistics things).

But it's hard to tell without more information about your problem. What does the morphology look like? What are you trying to parse? :)

onechrisjones commented 7 years ago

well... we are trying to parse ANY linguistic data, as this app would ideally work for any language. we want to develop a pretty simple (html/js) UI that linguists could use with minimal training, to 'train a parser' as they define stems, roots, and affixation (etc. depending on the features of their particular language...) to their lexicon. does that make sense?

kach commented 7 years ago

It took me a moment to understand, but I think you're talking about parsing within words rather than parsing sentences — so you might parse "unlockable" as the structure un- ((lock) -able) made up of the morphemes un, lock, and able. Do I understand you correctly?

I don't know enough about linguistics, and haven't studied enough different languages to be really well-informed about this. However, two things I can tell you:

First: If your languages have written forms and crazy historical spelling rules (ahem, English) then this gets kind of prickly with any computational technique, if only because of all the special cases. However, if you are transcribing a traditionally spoken language (as I imagine field linguists tend to do), then I think nearley would work very well for analyzing word morphology. In fact, the nearley grammar itself would be quite simple.

Second: Among JS parsers, nearley is probably the best choice because it handles ambiguity correctly, which seems like an important feature for linguistic exploration (if a certain word could be parsed in two different ways, then nearley will give you both ways — very few other parsers know how to do this).

Again, my only experience with computational linguistics is from a couple of introductory textbooks, so I'm afraid I can't really give more details without more information on what you are trying to do.

onechrisjones commented 7 years ago

So, yes... we want a computational tool that automatically produces a morphological analysis for a given word form.

Admittedly, there are many pieces to the whole 'parsing puzzle', but we want a tool that will give researchers 'good direction', and help them make educated guesses, instead of just doing al the work for them (that i probably near impossible...).

We need to be able to train a parser to recognize derivation and inflection.

there are different kinds of affixation in different languages... -infix- , -suffix, pre- fix, reduplication, 'null' morphemes... (as the 'absence' of a feature might be a feature in itself)...

Many of the languages WILL have established alphabets/orthographies with gnarly historical rules... some researchers will actually be using our app to help them learn the languages, so they can DEVELOP such orthographies... these researchers will be entering IPA... so our parsing tool will need to handle that...

This will be a tool that helps researchers build grammars for the crazy languages they might be working on, and discover the features of that language as well.

a morphological parser would have to take unicode input,

It would thus need to 'know'

The parser would then need to:

Note, though... that the parser doesn't just do all this junk automatically, by itself... rather, it does it in conjunction with researcher... the researcher would need a UI to be able to train the parser to parse based on rules and tags the linguist adds.

to quote an SIL paper:

properly using and controlling the constraints is the major task in implementing a parser for a given language. Since a morphological parser must model linguistic reality, it is a good idea to use constraints that model appropriate linguistic notions. Two major concepts for morphology are morphotactics and morphophonemics. Morphotactics deal with what morphemes can co-occur with what other morphemes. Morphophonemics deal with what shape a given morpheme will have in various phonological and morphological environments.

If you are interested in this junk...

http://fieldworks.sil.org/flex/grammar/

Hit the links there... good stuff...

FLEx (FieldWorks) is a tool developed by SIL (one that i used extensively in developing the language I work in...) that we are using much of the functionality of, but implementing the C# libraries they've built in our electron app will probably be as hard ad making Nearley work for us to do the same thing.

The FieldWorks Parsers work as a researcher 'interlinearizes' the texts he has entered.

Thoughts?

kach commented 6 years ago

Closing because of inactivity. Let me know if you still need any help with this! :-)