LSDC The Low Saxon Dialect Classification dataset (VarDial 2020)

Info

Setup:

A language detection algorithm for lang:de should cover German dialects and regional varieties, e.g. lang:nds (Lower German), lang:gsw (Swiss German), lang:de-AT (Austrian German).
The corpus contains 100% lang:nds text examples from regions in North-West Germany and North-East Netherlands. Thus, we add 25% Standard German and 25% Standard Dutch sentence examples with label=0 to the dataset. The result is a 50/50 balance binary classification dataset.

Class labels

Add random examples for two further class labels

ulf1 / sentence-embedding-evaluation-german