A language detection algorithm for lang:de should cover German dialects and regional varieties, e.g. lang:nds (Lower German), lang:gsw (Swiss German), lang:de-AT (Austrian German).
The corpus contains 100% lang:gsw text examples from German-speaking part of Switzerland. Thus, we add 50% Standard German sentence examples with label=0 to the dataset. The result is a 50/50 balance binary classification dataset.
Setup:
lang:de
should cover German dialects and regional varieties, e.g.lang:nds
(Lower German),lang:gsw
(Swiss German),lang:de-AT
(Austrian German).lang:gsw
text examples from German-speaking part of Switzerland.Thus, we add 50% Standard German sentence examples withlabel=0
to the dataset. The result is a 50/50 balance binary classification dataset.Class labels
Random examples with extra class labels