quanteda / stopwords

Multilingual Stopword Lists in R
http://stopwords.quanteda.io
Other
113 stars 9 forks source link

Update ancient Greek and Latin stopwords #19

Closed kbenoit closed 4 years ago

kbenoit commented 4 years ago

Source: https://wiki.digitalclassicist.org/Stopwords_for_Greek_and_Latin

From #3

aurelberra commented 4 years ago

Hello. Thank you for this very useful initiative! Compiling stopwords in quanteda is a great idea, and I was very pleased to see that Greek and Latin were included.

I have worked on the topic and a few years ago I revised the (quite old) page on the Digital Classicist site that you took as a reference. It is a good starting point indeed, but the basic lists contain errors and are, to say the least, minimal.

This is why I maintain larger, corpus-based lists in this repository: https://github.com/aurelberra/stopwords, together with a rationale providing details on my motivation and the method I used.

Could it be useful to use corrected, sounder ancient Greek and Latin lists in your package? I'd be happy to share, discuss and test.

(Edit: adding @kbenoit's handle, just to make sure that my comment on a closed issue reaches the maintainers of the repo.)

kbenoit commented 4 years ago

Thanks @aurelberra, this is a good idea. I'd be happy to update these wordlists based on your revisions.

aurelberra commented 4 years ago

Thanks for your reply, @kbenoit, and sorry for not following up quicker. I'm happy to share the data.

I could follow the procedure described in the repo and add lists, either as TXT or RDA. However, although the lists are pretty stable now, you might want to access the latest versions via GitHub (Greek and Latin – both are derived from the JSON versions and should be stripped of comments) or Zenodo (this is the permanent link but probably not as usable just to get the lists)? I see that is what you do for other languages.

In any case, the @source could read something like this: "Aurélien Berra, Ancient Greek and Latin stopwords, http://doi.org/10.5281/zenodo.1165205. See https://github.com/aurelberra/stopwords/blob/master/rationale.md." Feel free to adjust!

kbenoit commented 4 years ago

@aurelberra I checked out your lists, and they contain many, many different entries. Our existing Greek list for instance contains only 78 items. Which from your extensive listings do you think should be included?

Note that as we treat non-word characters (numerals, punctuation, etc) separately from stopwords, we would not include those.

aurelberra commented 4 years ago

@kbenoit I see how the number of words included can be surprising.

Typographical symbols and Arabic numbers can of course be excluded. Critical abbreviations and (in Latin) abbreviated praenomina are refinements introduced to address usual problems and could be discarded. It seems more useful to filter out single letters (Greek and Latin) and numerals specific to both languages.

The problem, however, is that inflected languages like Greek and Latin are difficult to deal with in an environment where we have no access to lemmatisation or normalisation. There are many alternative forms and spellings (especially for Greek with elided forms, forms merged with articles, dialectal variants and several possible accents – but Latin also has its u/v and i/j variants). Furthermore, the texts users will want to analyse span many centuries.

In fact, extensive or comprehensive paradigms have to be included for the lists to be reasonably useful. If we set aside invariable words (conjunctions, prepositions, postpositions, particles, interjections, adverbs), these massive lists cover only 54 words (lemmas) with all their forms for each language. I have detailed counts in a table in my rationale. The categories and number of lemmas are actually not so far from those in your Marimo source.

When building up my lists I first compared the existing lists of stopwords – the ones you have taken from the Digital Classicist wiki page are defective versions of the original Perseus lists, which themselves contain errors and do not really reflect frequencies. Then I rebased the lists on statistical analyses of the common textual databases and included less common forms of the words already included.

I am not sure how I would go about reducing such lists without losing their efficiency. What would you suggest?

kbenoit commented 4 years ago

How about you propose a single list of word (patterns) that could be used for your versions of the stopwords, for each language? These would be similar in scope to the existing list - so not including punctuation or numerals etc. Then a direct comparison will be easier.

aurelberra commented 4 years ago

Thank you. I just would like to be sure that we're talking about the same things when we say "words (patterns)" and "direct comparison". Do you mean to compare lemmas (dictionary entries) or actual forms included in the lists?

I will take the example of definite articles. English has 1 lemma ("the") with only 1 invariable form. Ancient Greek has 3 lemmas (masculine ὁ, feminine ἡ, neuter τό) with three numbers (singular, plural, dual), giving a total of 20 forms. It is necessary to add variations of accents (at least two forms, depending on the word position) and the most common suffixed forms (with the letter ι or the letters ιν as a mark of emphasis), giving a total of 44 forms. Depending on the genre of the text or the edition you use, the list will be useless if you don't include common graphical variants (whether or not a iota is written on its own after a long vowel), and dialectal spellings and variants, hence 9 more forms. Finally, I cannot see why a stopword list should not include the most common suffixed forms (with -περ to mean "precisely") and the common case where the article is merged with the word "and" (καί), which would be missed in string-based filtering, and add 28 forms. Hence the current 81 forms included in my list, for 3 lemmas.

Out of a total of 79 forms, the current Quanteda/Perseus list contains 13 forms for the articles: it omits three forms of the feminine plural, one of the masculine plural, and all forms of the dual (αἱ τάς ταῖς, τοῖς, τώ τοῖν ταῖν). Even from a statistical point of view, it does not make much sense. The current maintainers of the Perseus website were perplexed when I submitted the problem a few years ago, and they didn't know why some common forms had been left out – and even less why, for example, one extremely rare word was included in the Greek list (δαίς) and one non-existent word was included in the Latin list (adhic).

For Greek I have 427 invariable forms (particles, interjections, conjunctions, prepositions/postpositions, adverbs), and 54 lemmas which account for 5704 forms (articles, pronouns, nouns, adjectives, verbs). The Homeric stopwords sublist contains the most common forms specific to epic diction and can be useful for various corpora as Homer is quoted everywhere in Greek.

For Latin I have 240 invariable forms (conjunctions, prepositions, adverbs), and 54 lemmas which account for 3335 forms (pronouns, nouns, adjectives, verbs).

My point is that a purely quantitative comparison is misleading when adding data for highly-inflected languages. What do you think?


I copy here the tables with detailed counts which I provide in my rationale:

Greek v2.8 # forms # lemmas
Typographical symbols 28
Single letters (Latin) 26
Single letters (Greek) 28
Greek numerals (1-100) 100
Arabic numerals (0-100) 101
Roman numerals (1-100) 101
Critical abbreviations 154
Articles 81 3
Particles 44
Interjections 3
Conjunctions 83
Prepositions/postpositions 85
Adverbs 212
Pronouns 1435 22
Nouns 0 0
Adjectives 1206 23
Verbs 2982 6
Homeric stopwords 203
TOTAL 6872
TOTAL unique forms 6618
Latin v2.6 # forms # lemmas
Typographical symbols 28
Single letters (Latin) 27
Arabic numerals (0-100) 101
Roman numerals (1-100) 100
Critical abbreviations 154
Abbreviated praenomina 16
Conjunctions 62
Prepositions 42
Adverbs 136
Pronouns 967 27
Nouns 37 4
Adjectives 454 12
Verbs 1877 11
TOTAL 4001
TOTAL unique forms 3945
kbenoit commented 4 years ago

Thanks for that explanation. While it's hard to be sure since I know nothing about Latin or ancient Greek, I think we might be talking about two different things.

Stopword lists are words that a user would typically want to remove entirely. Stemming is different, since it removes word suffixes, but leaves the root word. The stopwords package is only for the former.

So the question is: how could you package your word list of ancient Latin and ancient Greek terms to be removed as two fixed word lists of words that a user would want to take out of their texts? It sounds like for instance for Greek, we would need 44 + 9 words to remove all variations of the definite article.

What I am not sure about is:

I cannot see why a stopword list should not include the most common suffixed forms (with -περ to mean "precisely") and the common case where the article is merged with the word "and" (καί), which would be missed in string-based filtering, and add 28 forms.

To unpack morphological variations in a way that retains the root of the word, that sounds like stemming. So it could be that in order to implement your overall approach, first a form of stemming would be needed and then stopword removal. There are ways to perform the morphological transformations using quanteda functions but these would not be possible with just the stopwords package machinery.

So: While I think we could get the "427 invariable forms (particles, interjections, conjunctions, prepositions/postpositions, adverbs)" through a listing of 427 words, I am not sure how we would implement the "54 lemmas which account for 5704 forms (articles, pronouns, nouns, adjectives, verbs)" through a word list.

Note: I used the phrase "word (patterns)" since it could, theoretically, include wild card patterns for instance "the*" to match "the", "they", "their", "them", etc. But this is inadvisable since that pattern would also remove "theft". So listing the fixed word patterns, rather than the wildcard patterns, is always preferred for precision.

aurelberra commented 4 years ago

There is no misunderstanding at all, then. Stemming is almost a useless method with ancient languages, for several reasons (for example in Greek various vowels are introduced in verbal forms when the person changes and compulsory accents move from one syllable to another depending on the cases used). We are talking about removing common words. These lists are in use in a software called Voyant Tools and I know they are routinely used in R by specialists (by me and my students, as well as some colleagues).

I do think that if I remove the "headings" added in the main versions for the sake of clarity, as well as non alphabetic symbols (28 typographical symbols and 101 Arabic numerals), the lists have about the same scope as the ones you use for other languages. The differences (and indeed inflation in length) are due to some problems that are specific to ancient Greek and Latin.

The details about "suffixed forms" you mentioned only mean I cover less common (but not so rare) cases where a word that we would like to stop happens to have a special ending. To take a Latin example this time, I have 28 forms for the pronoun ego, "I/me". Most forms in a text could be caught with only 3 forms (ego, me, mihi), but the same forms also occur as one augmented word ending in -que to mean "and I/me", although que is never used as a separate word: you will sometimes find egoque, meque, mihique, which will not be removed if you only have the simpler forms in your list. In the same way, a thorough lemmatisation of a Latin text would analyse mecum ("with me"), mecumque ("and with me"), mecumst ("is with me") as me + cum, me + cum + que, me + cum + est, but stemming would not work (and our lemmatisers are not yet up to the task for corner cases). Even more common, a Greek word ending with a vowel will usually take an extra letter (ν, "n", which is not a word on its own) before a word beginning with a vowel: in the 60 MFW of our reference corpus, the TLG, you will find both forms of "is" (ἐστι and ἐστιν), at about the same rank. Not including the suffixed form would result in missing a huge proportion of the forms of the most frequent verb. And of course you couldn't preprocess a text to remove all final νs, which would be a cure worse than the disease. There are no false positives when you include these forms: all forms of ego or ἐστι are removed and no forms of other words you would like to keep.

Another difference is that ancient Greek and Latin stopwords will only be used with historical texts and need to cover those variations of the most common forms which are associated with periods and literary genres – as if your basic lists for English had to accommodate spellings for contemporary newspapers, journal articles and recent novels, but also Shakespeare's English and some cockney forms and spellings used in nineteenth-century plays.

A handful of terms in each of my lists deal with practical problems: many Greek texts we use contain other information in Latin script which introduce noise (e.g. references of quoted passages), so that removing Latin letters makes sense, while many available digital Greek and Latin texts contain remnants of their apparatus criticus (like abbreviated Latin terms to indicate that manuscript X or scholar Y added word Z on this or that line). It is advisable to remove such words, and this intervention does not interfere with textual content.

However, again, pure morphological variations – i.e. the forms of words used in different cases (functions), persons and grammatical genres, as well as accents in Greek – account for most of the forms in my lists.

So, as long as you don't compute statistical thresholds tailored to each corpus, I also think the only way is to include more "fixed word patterns". As I see them, my current lists are "two fixed word lists of words that a user would want to take out of their texts".

I have prepared simple lists without typographical symbols or Arabic numerals: they contain 6489 unique forms in Greek (source file) and 3816 unique forms in Latin (source file).

I hope this helps clarifying the discussion.

kbenoit commented 4 years ago

OK, thanks a lot for that. I can add those word lists to a new stopwords source, and will put this on the list of things to do for the next revision! I'll send you the draft before merging so you can check it. Thanks!

kbenoit commented 4 years ago

Note: We could make this a new "source" called voyant, or make them the defaults for Greek and Latin in ancient, and rename the existing ones as digitalclassicist. What's your preference?

aurelberra commented 4 years ago

Excellent! I think the best solution would be to make these lists the defaults in the ancient source, and rename the current ones as perseus (as mentioned on the Digital Classicist wiki page and the relevant page of the Perseus Library, their origin is the older version of the Perseus website).

The TXT files of the stopwords_for_quanteda section of my repo should suit your needs. If you'd like me to provide lists in a specific format, please just tell me what makes the update easier.

kbenoit commented 4 years ago

OK, I've done that, try installing the add-voyant branch. (I realize now that should not be the name of it though!)

remotes::install_github("quanteda/stopwords", ref = "add-voyant")
aurelberra commented 4 years ago

Everything looks fine!

On the attached graphs, I have circled the relevant terms: you can see (for Greek) that the perseus list is not a great improvement compared to the MFW with no stopwords, while the updated ancient brings up the terms that can be considered meaningful. test_grc_no_stopwords test_grc_with_stopwords_perseus test_grc_with_stopwords_ancient