Open chainsawriot opened 4 years ago
Thanks @chainsawriot this makes sense to me. Breaking down into categories not only gives users choice but reduces missing words in translation. It also works in the same way as current dictionary in quanteda. Only problem is that the list is really long. We could use short handed expressions
AFRICA:
EAST:
'BI': {country: [Burundi], people: [Burundian*], city: [Bujumbura]}
'DJ': {country: [Djibouti], people: [Djiboutian*], city: [Djibouti]}
'ER': {country: [Eritrea], people: [Eritrean*], city: [Asmara]}
'ET': {country: [Ethiopia], people: [Ethiopian*], city: [Addis Ababa]}
but it looks like JSON..... Is there a good way to make the file shorter?
As for "switching off" sub-categories, I started discussion in an issue for quanteda. Please join us.
An additional consideration: maybe the second category is not simply "people". In English, it is not a big problem because a demonym (e.g. japanese) is almost always an adjective as well (e.g. japanese cuisine). But it is not always valid for other languages. I have worked with the German one by @stefan-mueller
EAST:
'CN':
name: [China, Chinas, Volksrepublik China]
demonym: [chinesisch*]
city: [Peking, Shanghai]
'HK':
name: [Hongkong, Hongkongs]
demonym: [Hongkonger]
city: []
'JP':
name: [Japan, Japans]
demonym: [japanisch*]
city: [Tokyo, Tokio]
Usually, the country name category has the country name in the orgainal form (Japan) and as "Genitivobjekt" (Japans). The problem here is that the 2nd category is not always demonym or people. In the German dictionary, it has mostly adjectives (japanisch*, as in japanisches Resturant). But not people / demonym, e.g. Japaner/Japanerin.
In some cases, however, it needs to be a demonym. As you can see from the case of Hongkong
, the 2nd category is the demonym of Hongkong
. I don't think there is a German adjective derived from the noun Hongkong
(Achtung: Mein Deutsch ist nur B1).
I can foresee similar issue with Chinese and Japanese. A reasonably segmenter would seperate demonyms and adjectives in these two languages. (e.g. 米国人 becomes 米国 and 人). The 2nd category might not be very useful.
tokens(c("ドナルド・トランプは米国人です。"))
I don't have a good suggestion on how to call the second category.
I called the second category "people" only because demonym is
a word (such as Nevadan or Sooner) used to denote a person who inhabits or is native to a particular place
Your categories seem like "base" and "derivative", but we should make categories based on how we will use instead of formal definitions. Why did you want to "switch off" some of the words in your projects?
Why did you want to "switch off" some of the words in your projects?
The application was actually simple: before fitting the model, I wanted to have some descriptive information about my corpus, e.g. total number of exact matches of a country name.
Currently included dictionaries have a mixture of country names (e.g Germany), demonyms (e.g. German) and cities/regions (e.g. Berlin, Frankfurt). For some applications, one might want to switch off certain categories.
My proposal is to reorganize the yaml dictionaries into a format like this:
The problem, however, is usage. This dctionary can still be used as usual, e.g. level 1 to 3.
There is no easy way to "switch off" certain categories. The closest I can do with quanteda is something like this:
Or the unix-wizardary method (certainly not a solution for a package, but work for me in my own project.)