voc / voctoweb

voctoweb – the frontend and backend software behind media.ccc.de
GNU General Public License v3.0
188 stars 55 forks source link

Unterschiedliche Sprachen-Abkürzungen in c3subtitles und media #99

Closed percidae closed 8 years ago

percidae commented 8 years ago

Bei c3subtitles werden die language-codes von Amara verwendet, auch zum Upload der *.srt-Files. Derzeit aktiv sind folgende Sprachen:

Diese Auswahl kann sich aber mit nur einem Klick eines Users ändern und erweitern. Insgesamt sind bei c3subtitles 311 language codes hinterlegt, alle von amara + None. Ausgeschrieben werden die englischen Sprach-Bezeichnungen von Amara angezeigt.

Es wäre sinnvoll uns hier auf gleiche Codes oder eine Übersetzungstabelle zu einigen. Das betrifft languages.rb @manno @saerdnaer

saerdnaer commented 8 years ago

Bisher haben wir in der languages.rb folgende einträge

    'deu' => 'de',
    'eng' => 'en',
    'fra' => 'fr',
    'gsw' => 'de-ch'

TODO

saerdnaer commented 8 years ago

Hier mal der Vollständigkeit halbe die komplette Liste von Amara:

{
    "languages": {
        "aa": "Afar",
        "ab": "Abkhazian",
        "ae": "Avestan",
        "aeb": "Tunisian Arabic",
        "af": "Afrikaans",
        "aka": "Akan",
        "amh": "Amharic",
        "an": "Aragonese",
        "ar": "Arabic",
        "arc": "Aramaic",
        "arq": "Algerian Arabic",
        "as": "Assamese",
        "ase": "American Sign Language",
        "ast": "Asturian",
        "av": "Avaric",
        "ay": "Aymara",
        "az": "Azerbaijani",
        "ba": "Bashkir",
        "bam": "Bambara",
        "be": "Belarusian",
        "ber": "Berber",
        "bg": "Bulgarian",
        "bh": "Bihari",
        "bi": "Bislama",
        "bn": "Bengali",
        "bnt": "Ibibio",
        "bo": "Tibetan",
        "br": "Breton",
        "bs": "Bosnian",
        "bug": "Buginese",
        "ca": "Catalan",
        "cak": "Cakchiquel, Central",
        "ce": "Chechen",
        "ceb": "Cebuano",
        "ch": "Chamorro",
        "cho": "Choctaw",
        "cku": "Koasati",
        "cnh": "Hakha Chin",
        "co": "Corsican",
        "cr": "Cree",
        "cs": "Czech",
        "ctd": "Chin, Tedim",
        "ctu": "Chol, Tumbal\u00e1",
        "cu": "Church Slavic",
        "cv": "Chuvash",
        "cy": "Welsh",
        "da": "Danish",
        "de": "German",
        "de-ch": "German (Switzerland)",
        "dsb": "Lower Sorbian",
        "dv": "Divehi",
        "dz": "Dzongkha",
        "ee": "Ewe",
        "efi": "Efik",
        "el": "Greek",
        "en": "English",
        "en-gb": "English, British",
        "eo": "Esperanto",
        "es": "Spanish",
        "es-ar": "Spanish, Argentinian",
        "es-mx": "Spanish, Mexican",
        "es-ni": "Spanish, Nicaraguan",
        "et": "Estonian",
        "eu": "Basque",
        "fa": "Persian",
        "ff": "Fulah",
        "fi": "Finnish",
        "fil": "Filipino",
        "fj": "Fijian",
        "fo": "Faroese",
        "fr": "French",
        "fr-ca": "French (Canada)",
        "ful": "Fula",
        "fy-nl": "Frisian",
        "ga": "Irish",
        "gd": "Scottish Gaelic",
        "gl": "Galician",
        "gn": "Guaran",
        "got": "Gothic",
        "gsw": "Swiss German",
        "gu": "Gujarati",
        "gv": "Manx",
        "hai": "Haida",
        "hau": "Hausa",
        "haw": "Hawaiian",
        "haz": "Hazaragi",
        "hb": "HamariBoli (Roman Hindi-Urdu)",
        "hch": "Huichol",
        "he": "Hebrew",
        "hi": "Hindi",
        "ho": "Hiri Motu",
        "hr": "Croatian",
        "hsb": "Upper Sorbian",
        "ht": "Creole, Haitian",
        "hu": "Hungarian",
        "hup": "Hupa",
        "hus": "Huastec, Veracruz",
        "hy": "Armenian",
        "hz": "Herero",
        "ia": "Interlingua",
        "ibo": "Igbo",
        "id": "Indonesian",
        "ie": "Interlingue",
        "ii": "Sichuan Yi",
        "ik": "Inupia",
        "ilo": "Ilocano",
        "inh": "Ingush",
        "io": "Ido",
        "iro": "Iroquoian languages",
        "is": "Icelandic",
        "it": "Italian",
        "iu": "Inuktitut",
        "ja": "Japanese",
        "jv": "Javanese",
        "ka": "Georgian",
        "kaa": "Karakalpak",
        "kar": "Karen",
        "kau": "Kanuri",
        "kik": "Gikuyu",
        "kin": "Rwandi",
        "kj": "Kuanyama, Kwanyama",
        "kk": "Kazakh",
        "kl": "Greenlandic",
        "km": "Khmer",
        "kn": "Kannada",
        "ko": "Korean",
        "kon": "Kongo",
        "ks": "Kashmiri",
        "ksh": "Colognian",
        "ku": "Kurdish",
        "kv": "Komi",
        "kw": "Cornish",
        "ky": "Kyrgyz",
        "la": "Latin",
        "lb": "Luxembourgish",
        "lg": "Ganda",
        "li": "Limburgish",
        "lin": "Lingala",
        "lkt": "Lakota",
        "lld": "Ladin",
        "lo": "Lao",
        "lt": "Lithuanian",
        "ltg": "Latgalian",
        "lu": "Luba-Katagana",
        "lua": "Luba-Kasai",
        "luo": "Luo",
        "lus": "Mizo",
        "lut": "Lushootseed",
        "luy": "Luhya",
        "lv": "Latvian",
        "mad": "Madurese",
        "meta-audio": "Metadata: Audio Description",
        "meta-geo": "Metadata: Geo",
        "meta-tw": "Metadata: Twitter",
        "meta-video": "Metadata: Video Description",
        "meta-wiki": "Metadata: Wikipedia",
        "mfe": "Mauritian Creole",
        "mh": "Marshallese",
        "mi": "Maori",
        "mk": "Macedonian",
        "ml": "Malayalam",
        "mlg": "Malagasy",
        "mn": "Mongolian",
        "mni": "Manipuri",
        "mnk": "Mandinka",
        "mo": "Moldavian, Moldovan",
        "moh": "Mohawk",
        "mos": "Mossi",
        "mr": "Marathi",
        "ms": "Malay",
        "mt": "Maltese",
        "mus": "Muscogee",
        "my": "Burmese",
        "na": "Naurunan",
        "nan": "Hokkien",
        "nb": "Norwegian Bokmal",
        "nci": "Nahuatl, Classical",
        "nd": "North Ndebele",
        "ne": "Nepali",
        "ng": "Ndonga",
        "nl": "Dutch",
        "nn": "Norwegian Nynorsk",
        "nr": "Southern Ndebele",
        "nso": "Northern Sotho",
        "nv": "Navajo",
        "nya": "Chewa",
        "oc": "Occitan",
        "oji": "Ojibwe",
        "or": "Oriya",
        "orm": "Oromo",
        "os": "Ossetian, Ossetic",
        "pam": "Kapampangan",
        "pan": "Punjabi",
        "pap": "Papiamento",
        "pi": "Pali",
        "pl": "Polish",
        "pnb": "Western Punjabi",
        "prs": "Dari",
        "ps": "Pashto",
        "pt": "Portuguese",
        "pt-br": "Portuguese, Brazilian",
        "que": "Quechua",
        "qvi": "Quichua, Imbabura Highland",
        "raj": "Rajasthani",
        "rm": "Romansh",
        "ro": "Romanian",
        "ru": "Russian",
        "run": "Rundi",
        "rup": "Macedo",
        "ry": "Rusyn",
        "sa": "Sanskrit",
        "sc": "Sardinian",
        "scn": "Sicilian",
        "sco": "Scots",
        "sd": "Sindhi",
        "se": "Northern Sami",
        "sg": "Sango",
        "sgn": "Sign Languages",
        "sh": "Serbo-Croatian",
        "si": "Sinhala",
        "sk": "Slovak",
        "skx": "Seko Padang",
        "sl": "Slovenian",
        "sm": "Samoan",
        "sna": "Shona",
        "som": "Somali",
        "sot": "Sotho",
        "sq": "Albanian",
        "sr": "Serbian",
        "sr-latn": "Serbian, Latin",
        "srp": "Montenegrin",
        "ss": "Swati",
        "su": "Sundanese",
        "sv": "Swedish",
        "swa": "Swahili",
        "szl": "Silesian",
        "ta": "Tamil",
        "tar": "Tarahumara, Central",
        "te": "Telugu",
        "tet": "Tetum",
        "tg": "Tajik",
        "th": "Thai",
        "tir": "Tigrinya",
        "tk": "Turkmen",
        "tl": "Tagalog",
        "tlh": "Klingon",
        "to": "Tonga",
        "toj": "Tojolabal",
        "tr": "Turkish",
        "ts": "Tsonga",
        "tsn": "Tswana",
        "tsz": "Purepecha",
        "tt": "Tatar",
        "tw": "Twi",
        "ty": "Tahitian",
        "tzh": "Tzeltal, Oxchuc",
        "tzo": "Tzotzil, Venustiano Carranza",
        "ug": "Uyghur",
        "uk": "Ukrainian",
        "umb": "Umbundu",
        "ur": "Urdu",
        "uz": "Uzbek",
        "ve": "Venda",
        "vi": "Vietnamese",
        "vls": "Flemish",
        "vo": "Volapuk",
        "wa": "Walloon",
        "wbl": "Wakhi",
        "wol": "Wolof",
        "xho": "Xhosa",
        "yaq": "Yaqui",
        "yi": "Yiddish",
        "yor": "Yoruba",
        "yua": "Maya, Yucat\u00e1n",
        "za": "Zhuang, Chuang",
        "zam": "Zapotec, Miahuatl\u00e1n",
        "zh": "Chinese, Yue",
        "zh-cn": "Chinese, Simplified",
        "zh-hk": "Chinese, Traditional (Hong Kong)",
        "zh-sg": "Chinese, Simplified (Singaporean)",
        "zh-tw": "Chinese, Traditional",
        "zul": "Zulu"
    }
}

Unter https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes gibts auch noch ne Tabelle in dem die verschiedenen Codes nebeneinander stehen. Wenn ich das richtig sehe sind unsere Three-Letter-Codes (TLCs) die ISO 639-2/T, oder? //cc @a-tze

@manno Was ist deine Meinung: Sollen wir gleich alle Sprachen importieren, oder erst Stück für Stück nachpflegen wenn mal nen Untertitel in der entsprechenden Sprache kommt?

a-tze commented 8 years ago

Jap, wir haben jetzt 639-2, weil das auch die Codes im MP4 sind.

percidae commented 8 years ago

Okay, dann baue ich eine Übersetzungstabelle bei mir ein.

ISO 639-2 enthält ein "mul" für multiple languages, das wäre eine gute Benennung für "orignal" (also der Klingon-Workaround).

Die File-Benennung im cdn würde ich erst einmal bei den Amara-Codes lassen. Habe auch gerade bemerkt, dass die Liste von amara scheinbar geschrumpft ist, meine ist länger.

Sollte es dann eine Sprache sein die in dieser Spalte keinen Code hat, dann wird das Subtitles-File media nicht bekannt gemacht, landet aber trotzdem im cdn. Falls dieser vermutlich seltene Fall eintritt.

a-tze commented 8 years ago

Klingt gut!

percidae commented 8 years ago

Leider ist ISO 639-2 nicht ganz eindeutig. Siehe z.B. "ger" und "deu". Das hier ist die aktuellste Liste: ISO 639-2 sowie als Text-File in UTF-8: ISO-639-2_utf-8.txt

Da in media "deu" verwendet wurde, würde ich mich überall an den " terminology code" halten (siehe erster Link, oben in der Beschreibung).

percidae commented 8 years ago

Ich habe bei mir alles auf den aktuellen Stand gebracht. Eine Spalte enthält jetzt die - wenn verfügbar - 639-2 Codes. Ansonsten gibt es eine Spalte mit 639-1 Codes (sofern vorhanden) und den Namen der Sprache in De / En in ausgeschrieben.

Kann ich euch davon etwas zukommen lassen, oder an welche Stelle muss das mit rein und was davon? @saerdnaer @a-tze @manno @MaZderMind ?

(Was languages.rb in dem Zusammenhang genau tut ist mir nicht klar, da "de-ch" kein ISO 639-1 ist soweit ich das sehe.)