employ lexical tagging in searches

icornelius commented 3 years ago

The about page states that "The data in the Glossarial Concordance was compiled, lemmatized, and tagged by the late Larry D. Benson" and this seems to be confirmed by search module, where the options generated within the field "Middle English word(s)" match Benson's entries, still posted at the Harvard Chaucer site. Yet I cannot see what use is being made of Benson's lexical tagging.

Type "seke" into the field "Middle English word(s)", then select the option

sik adj. "sick, ill; (as n.) sick person," s.v. sick a. and sb. OED

and restrict the search location to Geoffrey Chaucer. A search on these criteria returns 41 matches; all 10 matches on the first page are false positives. They are forms of the verb SEEK, i.e., Benson's

sechen v. "seek, look for; search out, hunt; examine; ask, beseech," s.v. seek v. OED

Compare the relevant entries at the Harvard Chaucer Page, linked above.

I do not understand what GCME is doing in this case. It does not have the functionality of a glossary or a concordance. It appears to be returning tokens of the character-string seke. Why?

markpatton commented 3 years ago

When you use the middle english search box, you are doing pretty much exact string matching. The dropdown may give you a false impression that you are searching for all forms of a word. The intent was to just help users find middle english words to search, but perhaps it should be clarified on the page or in the documentation.

In contrast if you use the other search box, you are searching the raw tagged words produced by Benson. Try typing in sechen and you will see options for all the various forms like sechen@v%ppl or sechen@v%pr_1. What you can't do is search for all of those sechen words at once. It wouldn't be very hard to add, but some thought would have to go into making the user interface clear.

icornelius commented 3 years ago

I suggest that at minimum some strong caveats are needed here and that Benson's entry-titles should not appear if they are not used in the search. Character-string searches are generally unhelpful in Middle English texts (because spelling vary) -- which is why Benson's 2-vol printed Glossarial Concordance to the Riverside Chaucer remains superior to any digital tool I have seen. Folks who know what a glossarial concordance is will be confused and disappointed by the present functionality of GCME. The concept is promising and Benson's parsed texts will serve as a model for analysis of other Middle English; the challenge is to construct a search function that represents, accurately and at the appropriate level of generality, the lexical analysis encoded in the underlying data.

markpatton commented 3 years ago

I agree that the drop downs could be confusing to the user in the first search box and should be clarified.

But I don't quite understand the other feedback. The lexical tagging can be searched using the second search box. For example sechen@v%pr matches sekith, sek, sech, etc. As far as I can tell this works precisely and accurately.

What is lacking is the means to search all of words tagged with sechen at once. This would require a little work, but could be done.

icornelius commented 3 years ago

In my view the missing function should be the default one in a digital concordance. One expects a search that returns all occurrences of all morphosyntactic and orthographic forms of the selected lexeme (all word-tokens tagged with sechen, in our example). The default output should look like the entries in Benson's Glossarial Concordance. This isn't a matter of imitating print, but rather of retaining and using the linguistic/philological concepts that rightly organized print concordances. The challenge is to avoid either

degeneration into string matching
defaulting to the maximum resolution encoded in the underlying data (this is what I mean by "the appropriate level of generality" above).

Users will be grateful for the option of refining lexical searches by morphosyntactic attribute (e.g., output only the present forms of the selected verb sechen). Benson's print concordance cannot do that for us. But it should be an optional filter, not the default output of a concordance.

markpatton commented 3 years ago

I understand your feedback now. Thank you for clarifying. Unfortunately this just did not come up as a core use case during the site design.

We will keep track of this feature request.

jhu-digital-manuscripts / gcme

employ lexical tagging in searches #36