A language detection algorithm for lang:de should cover German dialects and regional varieties, e.g. lang:nds (Lower German), lang:gsw (Swiss German), lang:de-AT (Austrian German).
The corpus contains 100% lang:nds text examples from regions in North-West Germany and North-East Netherlands. Thus, we add 25% Standard German and 25% Standard Dutch sentence examples with label=0 to the dataset. The result is a 50/50 balance binary classification dataset.
Class labels
Achterhoek (ACH),
Drenthe (DRE),
Groningen (GRO) (not included; i guess copyright issues)
Info
Setup:
lang:de
should cover German dialects and regional varieties, e.g.lang:nds
(Lower German),lang:gsw
(Swiss German),lang:de-AT
(Austrian German).The corpus contains 100%lang:nds
text examples from regions in North-West Germany and North-East Netherlands. Thus, we add 25% Standard German and 25% Standard Dutch sentence examples withlabel=0
to the dataset. The result is a 50/50 balance binary classification dataset.Class labels
Groningen (GRO)(not included; i guess copyright issues)Lower Prussia (NPR)(dead dialect)Add random examples for two further class labels