koheiw / newsmap

Semi-supervised algorithm for geographical document classification
Other
61 stars 22 forks source link

Proposal: updating dictionaries by adding categories #41

Open chainsawriot opened 4 years ago

chainsawriot commented 4 years ago

Currently included dictionaries have a mixture of country names (e.g Germany), demonyms (e.g. German) and cities/regions (e.g. Berlin, Frankfurt). For some applications, one might want to switch off certain categories.

My proposal is to reorganize the yaml dictionaries into a format like this:

# Created by Kohei Watanabe for the newsmap package: https://github.com/koheiw/newsmap

AFRICA:
  EAST:
    'BI':
      name: [Burundi]
      demonym: [Burundian*]
      city: [Bujumbura]
    'DJ':
      name: [Djibouti]
      demonym: [Djiboutian*]
      city: [Djibouti]
    'ER':
      name: [Eritrea]
      demonym: [Eritrean*]
      city: [Asmara]
    'ET':
      name: [Ethiopia]
      demonym: [Ethiopian*]
      city: [Addis Ababa]
    'KE':
      name: [Kenya]
      demonym: [Kenyan*]
      city: [Nairobi]
    'KM':
      name: [Comoros]
      demonym: [Comorian*]
      city: [Moroni]
    'MG':
      name: [Madagascar]
      demonym: [Madagas*, Malagas*]
      city: [Antananarivo]
    'MU':
      name: [Mauritius]
      demonym: [Mauritian*]
      city: [Port Louis]
    'MW':
      name: [Malawi]
      demonym: [Malawian*]
      city: [Lilongwe]
    'MZ':
      name: [Mozambique]
      demonym: [Mozambican*]
      city: [Maputo]
    'RE':
      name: [Reunion, Réunion]
      demonym: [Réunionese, Reunionese*, Reunionnais]
      city: []
    'RW':
      name: [Rwanda]
      demonym: [Rwandan*]
      city: [Kigali]
    'SC':
      name: [Seychelles]
      demonym: [Seychelloise, Seychellois]
      city: []
    'SO':
      name: [Somalia]
      demonym: [Somali*, Somalian*]
      city: [Mogadishu]
    'TZ':
      name: [Tanzania]
      demonym: [Tanzanian*]
      city: [Dodoma, Dar es Salaam]
    'UG':
      name: [Uganda]
      demonym: [Ugandan*]
      city: [Kampala]
    'YT':
      name: [Mayotte]
      demonym: [Mahoran*]
      city: [Mamoudzou]
    'ZM':
      name: [Zambia]
      demonym: [Zambian*]
      city: [Lusaka]
    'ZW':
      name: [Zimbabwe]
      demonym: [Zimbabwean*]
      city: [Harare]

  MIDDLE:
    'AO':
      name: [Angola]
      demonym: [Angolan*]
      city: [Luanda]
    'CD':
      name: [Democratic Republic Congo, DR Congo, DRC]
      demonym: [DR Congolese, Democratic Republic Congolese]
      city: [Kinshasa]
    'CF':
      name: [Central African Republic]
      demonym: [Central African*]
      city: [Bangui]
    'CG':
      name: [Congo, Congo Republic]
      demonym: [Congolese]
      city: [Brazzaville]
    'CM':
      name: [Cameroon]
      demonym: [Cameroonian*]
      city: [Yaounde, Yaoundé]
    'GA':
      name: [Gabon]
      demonym: [Gabonese]
      city: [Libreville]
    'GQ':
      name: [Equatorial Guinea]
      demonym: [Equatorial Guinean*, Equatoguinean*]
      city: [Malabo]
    'ST':
      name: [Sao Tome and Principe, São Tomé und Príncipe]
      demonym: [Sao Tomean*]
      city: []
    'TD':
      name: [Chad]
      demonym: [Chadian*]
      city: [N'Djamena]

  NORTH:
    'DZ':
      name: [Algeria]
      demonym: [Algerian*]
      city: [Algiers]
    'EG':
      name: [Egypt]
      demonym: [Egyptian*]
      city: [Cairo]
    'EH':
      name: [Western Sahara]
      demonym: [Western Saharan*]
      city: [El Aaiun]
    'LY':
      name: [Libya]
      demonym: [Libyan*]
      city: [Tripoli]
    'MA':
      name: [Morocco]
      demonym: [Moroccan*]
      city: [Rabat]
    'SD':
      name: [Sudan]
      demonym: [Sudanese]
      city: [Khartoum]
    'SS':
      name: [South Sudan, S Sudan]
      demonym: [S Sudanese]
      city: [Juba]
    'TN':
      name: [Tunisia]
      demonym: [Tunisian*]
      city: [Tunis]

  SOUTH:
    'BW':
      name: [Botswana]
      demonym: [Botswanan*]
      city: [Gaborone]
    'LS':
      name: [Lesotho]
      demonym: [Lesothonian*]
      city: [Maseru]
    'NA':
      name: [Namibia]
      demonym: [Namibian*]
      city: [Windhoek]
    'SZ':
      name: [Swaziland]
      demonym: [Swazi*]
      city: [Lobamba, Mbabane]
    'ZA':
      name: [South Africa, SA]
      demonym: [S African, SA, South African*, S African*]
      city: [Cape Town, Johannesburg, Pretoria]

  WEST:
    'BF':
      name: [Burkina Faso]
      demonym: [Burkinabe*]
      city: [Ouagadougou]
    'BJ':
      name: [Benin]
      demonym: [Beninese, Beninois]
      city: [Porto Novo]
    'CI':
      name: [Ivory Coast, Côte d'Ivoire, I Coast]
      demonym: [Ivorian*]
      city: [Yamoussoukro, Abidjan]
    'CV':
      name: [Cape Verde]
      demonym: [Cape Verdean*]
      city: [Praia]
    'GH':
      name: [Ghana]
      demonym: [Ghanaian*]
      city: [Accra]
    'GM':
      name: [Gambia]
      demonym: [Gambian*]
      city: [Banjul]
    'GN':
      name: [Guinea]
      demonym: [Guinean*]
      city: [Conakry]
    'GW':
      name: [Guinea Bissau]
      demonym: [Guinea Bissauan*]
      city: [Bissau]
    'LR':
      name: [Liberia]
      demonym: [Liberian*]
      city: [Monrovia]
    'ML':
      name: [Mali]
      demonym: [Malian*]
      city: [Bamako]
    'MR':
      name: [Mauritania]
      demonym: [Mauritanian*]
      city: [Nouakchott]
    'NE':
      name: [Niger]
      demonym: [Nigerien*]
      city: [Niamey]
    'NG':
      name: [Nigeria]
      demonym: [Nigerian*]
      city: [Abuja, Lagos]
    'SH':
      name: [Saint Helena, St Helena]
      demonym: [Saint Helenian*, St Helenian*]
      city: [Jamestown]
    'SL':
      name: [Sierra Leone]
      demonym: [Sierra Leonean*]
      city: [Freetown]
    'SN':
      name: [Senegal]
      demonym: [Senegalese]
      city: [Dakar]
    'TG':
      name: [Togo]
      demonym: [Togolese]
      city: [Lome, Lomé]

AMERICA:
  CARIB:
    'AG':
      name: [Antigua and Barbuda]
      demonym: [Antiguan*, Barbudan*]
      city: []
    'AI':
      name: [Anguilla]
      demonym: [Anguillan*]
      city: [The Valley]
    'AW':
      name: [Aruba]
      demonym: [Aruban*]
      city: [Oranjestad]
    'BB':
      name: [Barbados]
      demonym: [Barbadian*]
      city: [Bridgetown]
    'BL':
      name: [Saint Barthelemy, Saint-Barthelemy, Saint-Barthélemy, St Barthe  lemy]
      demonym: [Barthelemois]
      city: [Gustavia]
    'BQ':
      name: [Bonaire]
      demonym: [Bonairean*]
      city: [Kralendijk]
    'BS':
      name: [Bahamas]
      demonym: [Bahamian*]
      city: [Nassau]
    'CU':
      name: [Cuba]
      demonym: [Cuban*]
      city: [Havana]
    'CW':
      name: [Curacao]
      demonym: [Curacaoan*]
      city: [Willemstad]
    'DM':
      name: [Commonwealth of Dominica]
      demonym: [Commonwealth Dominican*]
      city: [Roseau]
    'DO':
      name: [Dominican Republic]
      demonym: [Dominican*]
      city: [Santo Domingo]
    'GD':
      name: [Grenada]
      demonym: [Grenadian*]
      city: [Saint George's, St George's]
    'GP':
      name: [Guadeloupe]
      demonym: [Guadeloupean*]
      city: [Basse-Terre]
    'HT':
      name: [Haiti]
      demonym: [Haitian*]
      city: [Port au Prince, Port-au-Prince]
    'JM':
      name: [Jamaica]
      demonym: [Jamaican*]
      city: [Kingston]
    'KN':
      name: [Saint Kitts and Nevis, St Kitts and Nevis]
      demonym: [Kittitian*, Nevisian*]
      city: [Basseterre]
    'KY':
      name: [Cayman Islands, Cayman Island*]
      demonym: [Caymanian*]
      city: []
    'LC':
      name: [Saint Lucia, St Lucia]
      demonym: [Saint Lucian*, St Lucian*]
      city: [Castries]
    'MF':
      name: [Saint Martin, St Martin]
      demonym: [Saint Martiner*, St Martiner*]
      city: [Marigot]
    'MQ':
      name: [Martinique]
      demonym: [Martinican*]
      city: []
    'MS':
      name: [Montserrat]
      demonym: [Montserratian*]
      city: [Brades]
    'PR':
      name: [Puerto Rico]
      demonym: [Puerto Rican*]
      city: [San Juan]
    'SX':
      name: [Sint Maarten, St Maarten]
      demonym: [Sint Maartener*, St Maartener*]
      city: [Philipsburg]
    'TC':
      name: [Turks and Caicos Islands, Turks and Caicos Island*]
      demonym: []
      city: [Cockburn Town]
    'TT':
      name: [Trinidad and Tobago, Trinidad]
      demonym: [Trinidadian*, Tobagonian*, Trinbagonian*]
      city: [Port of Spain]
    'VC':
      name: [Saint Vincent and the Grenadines, St Vincent and the Grenadines]
      demonym: [Vincentian*]
      city: [Kingstown]
    'VG':
      name: [British Virgin Islands, Virgin Island*  ]
      demonym: []
      city: [Road Town]
    'VI':
      name: [United States Virgin Islands, US Virgin Islands, United States Virgin Island*, US Virgin Island*]
      demonym: []
      city: [Charlotte Amalie]

  CENTER:
    'BZ':
      name: [Belize]
      demonym: [Belizean*]
      city: [Belmopan]
    'CR':
      name: [Costa Rica]
      demonym: [Costa Rican*]
      city: [Ticos, San Jose]
    'GT':
      name: [Guatemala]
      demonym: [Guatemalan*]
      city: [Guatemala City]
    'HN':
      name: [Honduras]
      demonym: [Honduran*]
      city: [Tegucigalpa]
    'MX':
      name: [Mexico]
      demonym: [Mexican*]
      city: [Mexico City]
    'NI':
      name: [Nicaragua]
      demonym: [Nicaraguan*]
      city: [Managua]
    'PA':
      name: [Panama]
      demonym: [Panamanian*]
      city: [Panama City]
    'SV':
      name: [El Salvador]
      demonym: [Salvadoran*]
      city: [San Salvador]

  SOUTH:
    'AR':
      name: [Argentina, Argentine*]
      demonym: [Argentinian*]
      city: [Buenos Aires]
    'BO':
      name: [Bolivia]
      demonym: [Bolivian*]
      city: [Sucre, La Paz]
    'BR':
      name: [Brazil]
      demonym: [Brazilian*]
      city: [Brasilia, Sao Paulo, Rio]
    'CL':
      name: [Chile]
      demonym: [Chilean*]
      city: [Santiago]
    'CO':
      name: [Colombia]
      demonym: [Colombian*]
      city: [Bogota]
    'EC':
      name: [Ecuador]
      demonym: [Ecuadorian*]
      city: [Quito]
    'FK':
      name: [Falkland Islands, Falkland Island*]
      demonym: []
      city: []
    'GF':
      name: [French Guiana]
      demonym: [French Guianese]
      city: []
    'GY':
      name: [Guyana]
      demonym: [Guyanese]
      city: []
    'PE':
      name: [Peru]
      demonym: [Peruvian*]
      city: [Lima]
    'PY':
      name: [Paraguay]
      demonym: [Paraguayan*]
      city: [Asuncion]
    'SR':
      name: [Suriname]
      demonym: [Surinamese]
      city: [Paramaribo]
    'UY':
      name: [Uruguay]
      demonym: [Uruguayan*]
      city: [Montevideo]
    'VE':
      name: [Venezuela]
      demonym: [Venezuelan*]
      city: [Caracas]

  NORTH:
    'BM':
      name: [Bermuda]
      demonym: [Bermudan*]
      city: []
    'CA':
      name: [Canada]
      demonym: [Canadian*]
      city: [Ottawa, Toronto, Quebec]
    'GL':
      name: [Greenland]
      demonym: [Greenlander*]
      city: [Nuuk]
    'PM':
      name: [Saint Pierre and Miquelon, St Pierre and Miquelon]
      demonym: [Saint Pierrais, Miquelonnais]
      city: [Saint Pierre]
    'US':
      name: [United States, US]
      demonym: [American*]
      city: [Washington, New York]

ASIA:
  CENTER:
    'KG':
      name: [Kyrgyzstan]
      demonym: [Kyrgyz*]
      city: [Bishkek]
    'KZ':
      name: [Kazakhstan]
      demonym: [Kazakh*]
      city: [Astana]
    'TJ':
      name: [Tajikistan]
      demonym: [Tajiks*]
      city: [Dushanbe]
    'TM':
      name: [Turkmenistan]
      demonym: [Turkmen*]
      city: [Ashhabad]
    'UZ':
      name: [Uzbekistan]
      demonym: [Uzbek*]
      city: [Tashkent]

  EAST:
    'CN':
      name: [China]
      demonym: [Chinese]
      city: [Beijing, Shanghai]
    'HK':
      name: [Hong Kong]
      demonym: [Hongkongese]
      city: []
    'JP':
      name: [Japan]
      demonym: [Japanese]
      city: [Tokyo]
    'KP':
      name: [North Korea, N Korea, DPRK]
      demonym: [North Korean*, N Korean*]
      city: [Pyongyang]
    'KR':
      name: [South Korea, S Korea]
      demonym: [South Korean, S Korean*]
      city: [Seoul]
    'MN':
      name: [Mongolia]
      demonym: [Mongolian*]
      city: [Ulan Bator]
    'MO':
      name: [Macao, Macau]
      demonym: [Macanese]
      city: []
    'TW':
      name: [Taiwan]
      demonym: [Taiwanese]
      city: [Taipei]

  SOUTH:
    'AF':
      name: [Afghanistan]
      demonym: [Afghan*]
      city: [Kabul]
    'BD':
      name: [Bangladesh]
      demonym: [Bangladeshi*]
      city: [Dhaka, Dacca]
    'BT':
      name: [Bhutan]
      demonym: [Bhutanese]
      city: [Thimphu]
    'IN':
      name: [India]
      demonym: [Indian*]
      city: [Mumbai, New Delhi]
    'IR':
      name: [Iran]
      demonym: [Iranian*]
      city: [Tehran]
    'LK':
      name: [Sri Lanka]
      demonym: [Sri Lankan*]
      city: [Colombo]
    'MV':
      name: [Maldives]
      demonym: [Maldivian*]
      city: []
    'NP':
      name: [Nepal]
      demonym: [Nepali, Nepalese]
      city: [Katmandu, Kathmandu]
    'PK':
      name: [Pakistan]
      demonym: [Pakistani*]
      city: [Islamabad]

  SOUTH-EAST:
    'BN':
      name: [Brunei]
      demonym: [Bruneian*]
      city: []
    'ID':
      name: [Indonesia]
      demonym: [Indonesian*]
      city: [Jakarta]
    'KH':
      name: [Cambodia]
      demonym: [Cambodian*]
      city: [Phnom Penh]
    'LA':
      name: [Laos]
      demonym: [Laotian*]
      city: [Vientiane]
    'MM':
      name: [Myanmar, Burma]
      demonym: [Myanmarese, Burmese]
      city: [Yangon, Naypyidaw]
    'MY':
      name: [Malaysia]
      demonym: [Malaysian*]
      city: [Kuala Lumpur, Putrajaya]
    'PH':
      name: [Philippines, Philippine]
      demonym: [Filipino*, Filipina*]
      city: [Manila]
    'SG':
      name: [Singapore]
      demonym: [Singaporean*]
      city: [Singapore]
    'TH':
      name: [Thailand]
      demonym: [Thai]
      city: [Bangkok]
    'TL':
      name: [East Timor, Timor Leste]
      demonym: [East Timorese]
      city: [Dili]
    'VN':
      name: [Viet Nam, Vietnam]
      demonym: [Vietnamese]
      city: [Hanoi, Ho Chi Minh City]

  WEST:
    'AE':
      name: [United Arab Emirates, UAE]
      demonym: [Emirati*, Emiri*]
      city: [Dubai, Abu Dhabi]
    'AM':
      name: [Armenia]
      demonym: [Armenian*]
      city: [Yerevan]
    'AZ':
      name: [Azerbaijan]
      demonym: [Azerbaijani*, Azeri*]
      city: [Baku]
    'BH':
      name: [Bahrain]
      demonym: [Bahraini*]
      city: [Manama]
    'CY':
      name: [Cyprus]
      demonym: [Cypriot*]
      city: [Nicosia]
    'GE':
      name: [Georgia]
      demonym: [Georgian*]
      city: [Tbilisi]
    'IL':
      name: [Israel]
      demonym: [Israeli*]
      city: [Jerusalem]
    'IQ':
      name: [Iraq]
      demonym: [Iraqi*]
      city: [Baghdad]
    'JO':
      name: [Jordan]
      demonym: [Jordanian*]
      city: [Amman]
    'KW':
      name: [Kuwait]
      demonym: [Kuwaiti*]
      city: [Kuwait City]
    'LB':
      name: [Lebanon]
      demonym: [Lebanese]
      city: [Beirut]
    'OM':
      name: [Oman]
      demonym: [Omani*]
      city: [Muscat]
    'PS':
      name: [Palestine]
      demonym: [Palestinian*]
      city: [Gaza City, Gaza, West Bank]
    'QA':
      name: [Qatar]
      demonym: [Qatari*]
      city: [Doha]
    'SA':
      name: [Saudi Arabia]
      demonym: [Saudi*]
      city: [Riyadh]
    'SY':
      name: [Syria]
      demonym: [Syrian*]
      city: [Damascus]
    'TR':
      name: [Turkey]
      demonym: [Turk*]
      city: [Ankara, Istanbul]
    'YE':
      name: [Yemen]
      demonym: [Yemeni*]
      city: [Sana'a]

EUROPE:
  EAST:
    'BG':
      name: [Bulgaria]
      demonym: [Bulgarian*]
      city: [Sofia]
    'BY':
      name: [Belarus]
      demonym: [Belarusian*]
      city: [Minsk]
    'CZ':
      name: [Czech Republic]
      demonym: [Czech*]
      city: [Prague]
    'HU':
      name: [Hungary]
      demonym: [Hungarian*]
      city: [Budapest]
    'MD':
      name: [Moldova]
      demonym: [Moldovan*]
      city: [Chisinau]
    'PL':
      name: [Poland]
      demonym: [Polish, Pole*]
      city: [Warsaw]
    'RO':
      name: [Romania]
      demonym: [Romanian*]
      city: [Bucharest]
    'RU':
      name: [Russia]
      demonym: [Russian*]
      city: [Moscow]
    'SK':
      name: [Slovakia]
      demonym: [Slovak*]
      city: [Bratislava]
    'UA':
      name: [Ukraine]
      demonym: [Ukrainian*]
      city: [Kiev]

  NORTH:
    'AX':
      name: [Aland Islands, Aland Island*]
      demonym: [Alandish]
      city: [Mariehamn]
    'DK':
      name: [Denmark]
      demonym: [Danish, Dane*]
      city: [Copenhagen]
    'EE':
      name: [Estonia]
      demonym: [Estonian*]
      city: [Tallinn]
    'FI':
      name: [Finland]
      demonym: [Finnish, Finn*]
      city: [Helsinki]
    'FO':
      name: [Faeroe Islands, Faeroe Island*]
      demonym: [Faroese*]
      city: [Torshavn]
    'GB':
      name: [UK, United Kingdom, Britain]
      demonym: [British, Briton*, Brit*]
      city: [London]
    'GG':
      name: [Guernsey]
      demonym: [Guernseie*]
      city: [Saint Peter Port, St Peter Port]
    'IE':
      name: [Ireland]
      demonym: [Irish]
      city: [Dublin]
    'IM':
      name: [Isle of Man]
      demonym: []
      city: [Manx]
    'IS':
      name: [Iceland]
      demonym: [Icelandic, Icelander*]
      city: [Reykjavik]
    'JE':
      name: [Channel Islands, Channel Island*]
      demonym: []
      city: []
    'LT':
      name: [Lithuania]
      demonym: [Lithuanian*]
      city: [Vilnius]
    'LV':
      name: [Latvia]
      demonym: [Latvian*]
      city: [Riga]
    'NO':
      name: [Norway]
      demonym: [Norwegian*]
      city: [Oslo]
    'SE':
      name: [Sweden]
      demonym: [Swedish, Swede*]
      city: [Stockholm]
    'SJ':
      name: [Svalbard and Jan Mayen Islands]
      demonym: []
      city: []

  SOUTH:
    'AD':
      name: [Andorra]
      demonym: [Andorran*]
      city: []
    'AL':
      name: [Albania]
      demonym: [Albanian*]
      city: [Tirana]
    'BA':
      name: [Bosnia, Bosnia and Herzegovina, Herzegovina]
      demonym: [Bosnian*]
      city: [Sarajevo]
    'ES':
      name: [Spain]
      demonym: [Spanish, Spaniard*]
      city: [Madrid, Barcelona]
    'GI':
      name: [Gibraltar]
      demonym: [Gibraltarian*]
      city: [Llanitos]
    'GR':
      name: [Greece]
      demonym: [Greek*]
      city: [Athens]
    'HR':
      name: [Croatia]
      demonym: [Croatian*, Croat*]
      city: [Zagreb]
    'IT':
      name: [Italy]
      demonym: [Italian*]
      city: [Rome]
    'KV':
      name: [Kosovo]
      demonym: [Kosovan*]
      city: [Pristina]
    'ME':
      name: [Montenegro]
      demonym: [Montenegrin*]
      city: [Podgorica]
    'MK':
      name: [Macedonia]
      demonym: [Macedonian*]
      city: [Skopje]
    'MT':
      name: [Malta]
      demonym: [Maltese]
      city: [Valletta]
    'PT':
      name: [Portugal]
      demonym: [Portuguese]
      city: [Lisbon]
    'RS':
      name: [Serbia]
      demonym: [Serbian*, Serb*]
      city: [Belgrade]
    'SI':
      name: [Slovenia]
      demonym: [Slovenian*, Slovene*]
      city: [Ljubljana]
    'SM':
      name: [San Marino]
      demonym: [Sammarinese]
      city: []
    'VA':
      name: [Vatican]
      demonym: []
      city: []

  WEST:
    'AT':
      name: [Austria]
      demonym: [Austrian*]
      city: [Vienna]
    'BE':
      name: [Belgium]
      demonym: [Belgian*]
      city: [Brussels]
    'CH':
      name: [Switzerland]
      demonym: [Swiss*]
      city: [Zurich, Bern]
    'DE':
      name: [Germany]
      demonym: [German*]
      city: [Berlin, Frankfurt]
    'FR':
      name: [France]
      demonym: [French*]
      city: [Paris]
    'LI':
      name: [Liechtenstein]
      demonym: [Liechtenstein*]
      city: [Vaduz]
    'LU':
      name: [Luxembourg]
      demonym: [Luxembourgish, Luxembourger*]
      city: []
    'MC':
      name: [Monaco]
      demonym: [Monacan*, Monegasque*]
      city: []
    'NL':
      name: [Netherlands, Holland]
      demonym: [Dutch, Hollander*]
      city: [Amsterdam]

OCEANIA:
  AU-NZ:
    'AU':
      name: [Australia]
      demonym: [Australian*, Aussie*, Oz]
      city: [Canberra, Sydney]
    'CK':
      name: [Cook Islands, Cook Island*]
      demonym: []
      city: [Avarua]
    'NF':
      name: [Norfolk Island]
      demonym: [Norfolk Islander*]
      city: []
    'NZ':
      name: [New Zealand, N Zealand, NZ]
      demonym: [New Zealander*, Kiwi*]
      city: [Wellington, Auckland]

  MEL:
    'FJ':
      name: [Fiji]
      demonym: [Fijian*]
      city: []
    'NC':
      name: [New Caledonia]
      demonym: [New Caledonian*]
      city: [Noumea]
    'PG':
      name: [Papua New Guinea]
      demonym: [Papua New Guinean*, Papuan*]
      city: [Port Moresby]
    'SB':
      name: [Solomon Islands, Solomon Island*]
      demonym: []
      city: [Honiara]
    'VU':
      name: [Vanuatu]
      demonym: [Vanuatuan*]
      city: [Port Vila]

  MIC:
    'FM':
      name: [Micronesia]
      demonym: [Micronesian*]
      city: [Palikir]
    'GU':
      name: [Guam]
      demonym: [Guamanian*]
      city: [Hagatna]
    'KI':
      name: [Kiribati]
      demonym: [Kiribati*]
      city: [Tarawa]
    'MH':
      name: [Marshall Islands, Marshall Island*]
      demonym: [Marshallese]
      city: [Majuro]
    'MP':
      name: [Northern Mariana Islands, Northern Mariana Island*]
      demonym: []
      city: [Capital Hill]
    'NR':
      name: [Nauru]
      demonym: [Nauruan*]
      city: [Yaren]
    'PW':
      name: [Palau]
      demonym: [Palauan*]
      city: [Melekeok]

  POL:
    'AS':
      name: [American Samoa]
      demonym: [American Samoan*]
      city: [Pago Pago]
    'NU':
      name: [Niue]
      demonym: [Niuean*]
      city: [Alofi]
    'PF':
      name: [French Polynesia]
      demonym: [French Polynesian*]
      city: [Papeete]
    'PN':
      name: [Pitcairn Islands, Pitcairn Island*]
      demonym: []
      city: [Adamstown]
    'TK':
      name: [Tokelau]
      demonym: [Tokelauan*]
      city: [Nukunonu]
    'TO':
      name: [Tonga]
      demonym: [Tongan*]
      city: [Nuku'alofa]
    'TV':
      name: [Tuvalu]
      demonym: [Tuvaluan*]
      city: [Funafuti]
    'WF':
      name: [Wallis and Futuna Islands, Wallis and Futuna Island*]
      demonym: []
      city: [Mata Utu]
    'WS':
      name: [Samoa]
      demonym: [Samoan*]
      city: [Apia]

The problem, however, is usage. This dctionary can still be used as usual, e.g. level 1 to 3.

tokens(c("Germany", "Frankfurt")) %>% tokens_lookup(dictionary(file = "english.yml"), levels = 3)

There is no easy way to "switch off" certain categories. The closest I can do with quanteda is something like this:

### switching off demonym and city
tokens(c("Germany", "Frankfurt")) %>% tokens_lookup(dictionary(file = "english.yml"), levels = 3:4) %>%   
tokens_remove(pattern = "demonym$|city$", valuetype = "regex")  

Or the unix-wizardary method (certainly not a solution for a package, but work for me in my own project.)

clean_dict <- dictionary(yaml.load_file(textConnection(system("grep -Ev 'city:|demonym:' english.yml",  intern = TRUE))))
tokens(c("Germany", "Frankfurt")) %>% tokens_lookup(clean_dict, levels = 3)
koheiw commented 4 years ago

Thanks @chainsawriot this makes sense to me. Breaking down into categories not only gives users choice but reduces missing words in translation. It also works in the same way as current dictionary in quanteda. Only problem is that the list is really long. We could use short handed expressions

AFRICA:
  EAST:
    'BI': {country: [Burundi], people: [Burundian*], city: [Bujumbura]}
    'DJ': {country: [Djibouti], people: [Djiboutian*], city: [Djibouti]}
    'ER': {country: [Eritrea], people: [Eritrean*], city: [Asmara]}
    'ET': {country: [Ethiopia], people: [Ethiopian*], city: [Addis Ababa]}

but it looks like JSON..... Is there a good way to make the file shorter?

As for "switching off" sub-categories, I started discussion in an issue for quanteda. Please join us.

chainsawriot commented 4 years ago

An additional consideration: maybe the second category is not simply "people". In English, it is not a big problem because a demonym (e.g. japanese) is almost always an adjective as well (e.g. japanese cuisine). But it is not always valid for other languages. I have worked with the German one by @stefan-mueller

  EAST:                                                                                                   
    'CN':                                                                                                 
      name: [China, Chinas, Volksrepublik China]                                                          
      demonym: [chinesisch*]                                                                              
      city: [Peking, Shanghai]                                                                            
    'HK':                                                                                                 
      name: [Hongkong, Hongkongs]                                                                       
      demonym: [Hongkonger]                                                                               
      city: []                                                                                            
    'JP':                                                                                                 
      name: [Japan, Japans]                                                                               
      demonym: [japanisch*]                                                                               
      city: [Tokyo, Tokio]                                                                                                                                                            

Usually, the country name category has the country name in the orgainal form (Japan) and as "Genitivobjekt" (Japans). The problem here is that the 2nd category is not always demonym or people. In the German dictionary, it has mostly adjectives (japanisch*, as in japanisches Resturant). But not people / demonym, e.g. Japaner/Japanerin.

In some cases, however, it needs to be a demonym. As you can see from the case of Hongkong, the 2nd category is the demonym of Hongkong. I don't think there is a German adjective derived from the noun Hongkong (Achtung: Mein Deutsch ist nur B1).

I can foresee similar issue with Chinese and Japanese. A reasonably segmenter would seperate demonyms and adjectives in these two languages. (e.g. 米国人 becomes 米国 and 人). The 2nd category might not be very useful.

tokens(c("ドナルド・トランプは米国人です。"))

I don't have a good suggestion on how to call the second category.

koheiw commented 4 years ago

I called the second category "people" only because demonym is

a word (such as Nevadan or Sooner) used to denote a person who inhabits or is native to a particular place

Your categories seem like "base" and "derivative", but we should make categories based on how we will use instead of formal definitions. Why did you want to "switch off" some of the words in your projects?

chainsawriot commented 4 years ago

Why did you want to "switch off" some of the words in your projects?

The application was actually simple: before fitting the model, I wanted to have some descriptive information about my corpus, e.g. total number of exact matches of a country name.