tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 84 forks source link

Question: what is tag "abbreviation"? #127

Closed medmunds closed 2 years ago

medmunds commented 2 years ago

I'm looking for a way to filter out abbreviations, and was thinking tag:abbreviation would do the job. But the abbreviation tag seems to include several categories of words that I don't think would normally be considered abbreviations. So I'm trying to understand what it actually means, or what results in an entry or sense getting tagged abbreviation.

For example, all of these have tag:abbreviation:

Is the "abbreviation" tag used for every type of shortening?

The problem I need to solve is how to distinguish standalone words like "corgi" or "seltzer" from true abbreviations like "coord". Seltzer also has tag:ellipsis, so I could special case that, but "corgi" and "coord" are both tag:abbrevation and tag:alt-of, with no other tags.

[FWIW, the Wiktionary glossary considers clipping (dropping part of a word) to be a type of abbreviation, but specifically says ellipsis (dropping an entire word from a phrase) is "not to be confused with" clipping. The Wiktionary category tree also implies a hierarchy of shortenings, rooted in English shortenings. The abbreviations category is a subcategory of shortenings—and peer category to clippings, contractions, and ellipses.]

tatuylonen commented 2 years ago

The "abbreviation" tag is intended to be included for all the various kinds of abbreviated forms marked in various ways, including "abbreviated", "abbreviation as", "abbrev.", "abbreviation of", "short for", "eclipsed form of", "apocopic form of", "apocopated", "apocope", "acronym of", "acronym", "initialism of", "contraction of", "IUPAC 3-letter abbreviation for", "praenominal abbreviation of", "ellipsis of", "clipping of", "abbreviations". It will also be included if the gloss starts with "(abbreviation)" or similar.

This tag is usually combined with another tag that indicates what kind of abbreviation it is.

Please let me know if you notice places where it should not be included or is missing.

As to distinguishing the standalone forms, one way might be to see if the raw_gloss field begins with something like "(abbreviation)". Another way might be to see if the sense has an "alt_of" field; the system attempts to link abbreviations to their main words.

medmunds commented 2 years ago

Thanks, that's a helpful answer to my somewhat confused question. The more I've looked into this, the Wiktionary data is really inconsistent around how abbreviated forms are defined. And I couldn't find any real policies or guidelines on them, so it's not clear how the data should be in the first place. (It's impressive wiktextract is getting as much useful data as it is.)

I think it would be helpful to distinguish senses specifically identified as "abbreviation" (the raw_gloss has an "(abbreviation)" label, or the definition uses the {{abbreviation of}} template) from other specific abbreviated forms. Right now, the "abbreviation" tag seems to be overloaded to cover both.

[The "abbreviation"] tag is usually combined with another tag that indicates what kind of abbreviation it is. Please let me know if you notice places where it should not be included or is missing.

There are around 1000 senses in the 2022-03-09 English extraction that are tagged "abbreviation" but that don't have some other tag identifying a more specific kind of abbreviation ("acronym", "clipping", "contraction", "ellipsis", "initialism", or "shortening"), and that also don't have an identifying "(label)" in the raw_gloss. A few examples:

(Let me know if the full list would be helpful, or if I should open a specific new issue for this.)

tatuylonen commented 2 years ago

Looking at your list, the ones with alt-of have clearly been identified as an abbreviation of some other form (indicated int he "alt_of" field). AAR has (abbreviation) in outer-level gloss in a hierarchical set of glosses. One could argue they should have an alt_of based on the outer gloss, but I'm hesistant to assume that this would always be correct in similar cases. However in your list of examples, if it doesn't have alt-of, in each case the full gloss would have been a valid value for alt_of (i.e., the expression that it is an abbreviation of).

I think it is possible to distinguish the two cases you mention ("(abbreviation)" in raw_gloss vs. "abbreviation of" in gloss) by whether there is an alt-of tag and "alt_of" field. For various reasons adding the distinction in code is not entirely trivial and could break things (the whole tag recognition is a bit delicate and complicated). Unless there are strong arguments why the distinction would need to be made in code, I would be inclined to leave it as-is.

I'm not sure about the "ellipsis" cases. Should "abbreviation" not be included in them? Dropping abbreviation from "ellipsis of" would be easy. However "ellipsis" can also result from "elliptically" (presumably parenthesized at start of gloss).