stashapp / CommunityScrapers

This is a public repository containing scrapers created by the Stash Community.
https://stashapp.github.io/CommunityScrapers/
GNU Affero General Public License v3.0
642 stars 416 forks source link

Support nationality to country post-processing option #924

Closed Jacintocuenca closed 7 months ago

Jacintocuenca commented 2 years ago

Right now, some websites include the Country of the performes (Spain, Peru...) whereas others include the nationality (Spanish, Peruvian). The flag in the performer page only seems to parse countries and not nationalities. Would it make sense to add a post-processing option (similar to feetToCm or lbToKg) that would convert nationalities to countries, for more standardization?

bnkai commented 2 years ago

I dont know for how many scrapers the above is needed, we did have the below for the previous iafd scraper that can be added (adjusting the selector) to any xPath, json scraper that is using countries

      Country:
        selector: //div/p[text()='Nationality']/following-sibling::p[1]/text()
        postProcess:
          - map:
              # https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations
              Abkhaz: "Abkhazia"
              Abkhazian: "Abkhazia"
              Afghan: "Afghanistan"
              Albanian: "Albania"
              Algerian: "Algeria"
              American Samoan: "American Samoa"
              American: "United States of America"
              Andorran: "Andorra"
              Angolan: "Angola"
              Anguillan: "Anguilla"
              Antarctic: "Antarctica"
              Antiguan: "Antigua and Barbuda"
              Argentine: "Argentina"
              Argentinian: "Argentina"
              Armenian: "Armenia"
              Aruban: "Aruba"
              Australian: "Australia"
              Austrian: "Austria"
              Azerbaijani: "Azerbaijan"
              Azeri: "Azerbaijan"
              Bahamian: "Bahamas"
              Bahraini: "Bahrain"
              Bangladeshi: "Bangladesh"
              Barbadian: "Barbados"
              Barbudan: "Antigua and Barbuda"
              Basotho: "Lesotho"
              Belarusian: "Belarus"
              Belgian: "Belgium"
              Belizean: "Belize"
              Beninese: "Benin"
              Beninois: "Benin"
              Bermudan: "Bermuda"
              Bermudian: "Bermuda"
              Bhutanese: "Bhutan"
              BIOT: "British Indian Ocean Territory"
              Bissau-Guinean: "Guinea-Bissau"
              Bolivian: "Bolivia"
              Bonaire: "Bonaire"
              Bonairean: "Bonaire"
              Bosnian: "Bosnia and Herzegovina"
              Botswanan: "Botswana"
              Bouvet Island: "Bouvet Island"
              Brazilian: "Brazil"
              British Virgin Island: "Virgin Islands, British"
              British: "United Kingdom"
              Bruneian: "Brunei"
              Bulgarian: "Bulgaria"
              Burkinabé: "Burkina Faso"
              Burmese: "Burma"
              Burundian: "Burundi"
              Cabo Verdean: "Cabo Verde"
              Cambodian: "Cambodia"
              Cameroonian: "Cameroon"
              Canadian: "Canada"
              Cantonese: "Hong Kong"
              Caymanian: "Cayman Islands"
              Central African: "Central African Republic"
              Chadian: "Chad"
              Channel Island: "Guernsey"
              #Channel Island: "Jersey"
              Chilean: "Chile"
              Chinese: "China"
              Christmas Island: "Christmas Island"
              Cocos Island: "Cocos (Keeling) Islands"
              Colombian: "Colombia"
              Comoran: "Comoros"
              Comorian: "Comoros"
              Congolese: "Congo"
              Cook Island: "Cook Islands"
              Costa Rican: "Costa Rica"
              Croatian: "Croatia"
              Cuban: "Cuba"
              Curaçaoan: "Curaçao"
              Cypriot: "Cyprus"
              Czech: "Czech Republic"
              Danish: "Denmark"
              Djiboutian: "Djibouti"
              Dominican: "Dominica"
              Dutch: "Netherlands"
              Ecuadorian: "Ecuador"
              Egyptian: "Egypt"
              Emirati: "United Arab Emirates"
              Emiri: "United Arab Emirates"
              Emirian: "United Arab Emirates"
              English people: "England"
              English: "England"
              Equatoguinean: "Equatorial Guinea"
              Equatorial Guinean: "Equatorial Guinea"
              Eritrean: "Eritrea"
              Estonian: "Estonia"
              Ethiopian: "Ethiopia"
              European: "European Union"
              Falkland Island: "Falkland Islands"
              Faroese: "Faroe Islands"
              Fijian: "Fiji"
              Filipino: "Philippines"
              Finnish: "Finland"
              Formosan: "Taiwan"
              French Guianese: "French Guiana"
              French Polynesian: "French Polynesia"
              French Southern Territories: "French Southern Territories"
              French: "France"
              Futunan: "Wallis and Futuna"
              Gabonese: "Gabon"
              Gambian: "Gambia"
              Georgian: "Georgia"
              German: "Germany"
              Ghanaian: "Ghana"
              Gibraltar: "Gibraltar"
              Greek: "Greece"
              Greenlandic: "Greenland"
              Grenadian: "Grenada"
              Guadeloupe: "Guadeloupe"
              Guamanian: "Guam"
              Guatemalan: "Guatemala"
              Guinean: "Guinea"
              Guyanese: "Guyana"
              Haitian: "Haiti"
              Heard Island: "Heard Island and McDonald Islands"
              Hellenic: "Greece"
              Herzegovinian: "Bosnia and Herzegovina"
              Honduran: "Honduras"
              Hong Kong: "Hong Kong"
              Hong Konger: "Hong Kong"
              Hungarian: "Hungary"
              Icelandic: "Iceland"
              Indian: "India"
              Indonesian: "Indonesia"
              Iranian: "Iran"
              Iraqi: "Iraq"
              Irish: "Ireland"
              Israeli: "Israel"
              Israelite: "Israel"
              Italian: "Italy"
              Ivorian: "Ivory Coast"
              Jamaican: "Jamaica"
              Jan Mayen: "Jan Mayen"
              Japanese: "Japan"
              Jordanian: "Jordan"
              Kazakh: "Kazakhstan"
              Kazakhstani: "Kazakhstan"
              Kenyan: "Kenya"
              Kirghiz: "Kyrgyzstan"
              Kirgiz: "Kyrgyzstan"
              Kiribati: "Kiribati"
              Korean: "South Korea"
              Kosovan: "Kosovo"
              Kosovar: "Kosovo"
              Kuwaiti: "Kuwait"
              Kyrgyz: "Kyrgyzstan"
              Kyrgyzstani: "Kyrgyzstan"
              Lao: "Lao People's Democratic Republic"
              Laotian: "Lao People's Democratic Republic"
              Latvian: "Latvia"
              Lebanese: "Lebanon"
              Lettish: "Latvia"
              Liberian: "Liberia"
              Libyan: "Libya"
              Liechtensteiner: "Liechtenstein"
              Lithuanian: "Lithuania"
              Luxembourg: "Luxembourg"
              Luxembourgish: "Luxembourg"
              Macanese: "Macau"
              Macedonian: "North Macedonia"
              Magyar: "Hungary"
              Mahoran: "Mayotte"
              Malagasy: "Madagascar"
              Malawian: "Malawi"
              Malaysian: "Malaysia"
              Maldivian: "Maldives"
              Malian: "Mali"
              Malinese: "Mali"
              Maltese: "Malta"
              Manx: "Isle of Man"
              Marshallese: "Marshall Islands"
              Martinican: "Martinique"
              Martiniquais: "Martinique"
              Mauritanian: "Mauritania"
              Mauritian: "Mauritius"
              McDonald Islands: "Heard Island and McDonald Islands"
              Mexican: "Mexico"
              Moldovan: "Moldova"
              Monacan: "Monaco"
              Mongolian: "Mongolia"
              Montenegrin: "Montenegro"
              Montserratian: "Montserrat"
              Monégasque: "Monaco"
              Moroccan: "Morocco"
              Motswana: "Botswana"
              Mozambican: "Mozambique"
              Myanma: "Myanmar"
              Namibian: "Namibia"
              Nauruan: "Nauru"
              Nepalese: "Nepal"
              Nepali: "Nepal"
              Netherlandic: "Netherlands"
              New Caledonian: "New Caledonia"
              New Zealand: "New Zealand"
              Ni-Vanuatu: "Vanuatu"
              Nicaraguan: "Nicaragua"
              Nigerian: "Nigeria"
              Nigerien: "Niger"
              Niuean: "Niue"
              Norfolk Island: "Norfolk Island"
              Northern Irish: "Northern Ireland"
              Northern Marianan: "Northern Mariana Islands"
              Norwegian: "Norway"
              Omani: "Oman"
              Pakistani: "Pakistan"
              Palauan: "Palau"
              Palestinian: "Palestine"
              Panamanian: "Panama"
              Papua New Guinean: "Papua New Guinea"
              Papuan: "Papua New Guinea"
              Paraguayan: "Paraguay"
              Persian: "Iran"
              Peruvian: "Peru"
              Philippine: "Philippines"
              Pitcairn Island: "Pitcairn Islands"
              Polish: "Poland"
              Portuguese: "Portugal"
              Puerto Rican: "Puerto Rico"
              Qatari: "Qatar"
              Romanian: "Romania"
              Russian: "Russia"
              Rwandan: "Rwanda"
              Saba: "Saba"
              Saban: "Saba"
              Sahraouian: "Western Sahara"
              Sahrawi: "Western Sahara"
              Sahrawian: "Western Sahara"
              Salvadoran: "El Salvador"
              Sammarinese: "San Marino"
              Samoan: "Samoa"
              Saudi Arabian: "Saudi Arabia"
              Saudi: "Saudi Arabia"
              Scottish: "Scotland"
              Senegalese: "Senegal"
              Serbian: "Serbia"
              Seychellois: "Seychelles"
              Sierra Leonean: "Sierra Leone"
              Singapore: "Singapore"
              Singaporean: "Singapore"
              Slovak: "Slovakia"
              Slovene: "Slovenia"
              Slovenian: "Slovenia"
              Solomon Island: "Solomon Islands"
              Somali: "Somalia"
              Somalilander: "Somaliland"
              South African: "South Africa"
              South Georgia Island: "South Georgia and the South Sandwich Islands"
              South Ossetian: "South Ossetia"
              South Sandwich Island: "South Georgia and the South Sandwich Islands"
              South Sudanese: "South Sudan"
              Spanish: "Spain"
              Sri Lankan: "Sri Lanka"
              Sudanese: "Sudan"
              Surinamese: "Suriname"
              Svalbard resident: "Svalbard"
              Swati: "Eswatini"
              Swazi: "Eswatini"
              Swedish: "Sweden"
              Swiss: "Switzerland"
              Syrian: "Syrian Arab Republic"
              Taiwanese: "Taiwan"
              Tajikistani: "Tajikistan"
              Tanzanian: "Tanzania"
              Thai: "Thailand"
              Timorese: "Timor-Leste"
              Tobagonian: "Trinidad and Tobago"
              Togolese: "Togo"
              Tokelauan: "Tokelau"
              Tongan: "Tonga"
              Trinidadian: "Trinidad and Tobago"
              Tunisian: "Tunisia"
              Turkish: "Turkey"
              Turkmen: "Turkmenistan"
              Turks and Caicos Island: "Turks and Caicos Islands"
              Tuvaluan: "Tuvalu"
              Ugandan: "Uganda"
              Ukrainian: "Ukraine"
              Uruguayan: "Uruguay"
              Uzbek: "Uzbekistan"
              Uzbekistani: "Uzbekistan"
              Vanuatuan: "Vanuatu"
              Vatican: "Vatican City State"
              Venezuelan: "Venezuela"
              Vietnamese: "Vietnam"
              Wallis and Futuna: "Wallis and Futuna"
              Wallisian: "Wallis and Futuna"
              Welsh: "Wales"
              Yemeni: "Yemen"
              Zambian: "Zambia"
              Zimbabwean: "Zimbabwe"
              Åland Island:  "Åland Islands"
nrg101 commented 1 year ago

In #1288, I've created a Python base class that provides this conversion, and can be used like this in a derived/child class like:

performer['country'] = self._get_country_for_nationality(api_result['nationality'])

I realise this isn't useful for the xpath scrapers, but thought I'd mention it as providing a solution for any Python scrapers