issues
search
oscar-project
/
corpus
corpus issues.
Apache License 2.0
5
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Size mismatch
#24
DuyguA
closed
7 months ago
4
harmful pp
#23
jiangix-paper
closed
1 year ago
1
Missing pages in Common Crawl
#22
hadiasghari
opened
1 year ago
0
OSCAR 22.XX scope
#21
Uinelj
opened
2 years ago
3
How to load oscar data for specific language on hugging-face
#20
IsraelAbebe
closed
2 years ago
5
How much data is common between the two OSCAR versions?
#19
ibraheem-moosa
opened
2 years ago
2
OSCAR-2109 huggingface datasets are misaligned and truncated
#18
adrianeboyd
opened
2 years ago
14
Dataset name Issue in model card at Huggingface
#17
ibraheem-moosa
opened
2 years ago
0
Low size of Swahili Oscar
#16
hadyelsahar
opened
2 years ago
0
Vietnamese language: text and meta/warc-target-uri mismatched
#15
Luvata
opened
2 years ago
2
Scots language corpus is non linguistic?
#14
Uinelj
opened
2 years ago
0
Quality warning: Neapolitan
#13
Uinelj
opened
2 years ago
0
Quality warning: Somali
#12
Uinelj
opened
2 years ago
0
Quality warning: Northern Frisian
#11
Uinelj
opened
2 years ago
0
Quality warning: Chavacano
#10
Uinelj
opened
2 years ago
0
Quality warning: Central Bikol
#9
Uinelj
closed
2 years ago
1
ConnectionError: Couldn't reach https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/resolve/main/OSCAR-2109.py
#8
TDehaene
closed
2 years ago
7
West Flemish contains only two words
#7
Uinelj
opened
2 years ago
1
Wu Chinese dataset is of bad quality.
#5
Uinelj
opened
2 years ago
0
3835 records full of backslashes
#4
stas00
opened
2 years ago
1
Tajik language contains large chunks of Uzbek sentences in Cyrillic script.
#6
Muhtasham
opened
3 years ago
0
[BUG] Encoding errors in OSCAR 21.09
#2
stefan-it
opened
3 years ago
3
strange datasets for Yue Chinese corpus
#1
cosmeowpawlitan
opened
3 years ago
2
Support for Tigrinya
#3
tadeze
opened
4 years ago
0