issues
search
huggingface
/
datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k
stars
139
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Fix languages listify bug
#294
BramVanroy
opened
15 hours ago
1
Add several open-source text extraction libraries
#293
garrethlee
opened
3 days ago
0
SLURM cannot achieve cross-node parallelism
#292
ShayDuane
opened
4 days ago
0
SLURM Executor cannot achieve cross-node parallelism
#291
ShayDuane
closed
4 days ago
0
Multi-node parallelism on slurm clusters
#290
shizhediao
opened
1 week ago
4
Naming Gopher's "max_non_alpha_words_ratio"
#289
BramVanroy
opened
1 week ago
0
Question about language filter
#288
shizhediao
closed
1 week ago
1
Error in Stats
#287
crisgarrillo
closed
1 week ago
0
Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files.
#286
lyuwen
closed
2 weeks ago
6
Better multilingual support
#285
guipenedo
opened
3 weeks ago
0
Use spaCy tokenizer for Dutch
#284
BramVanroy
opened
3 weeks ago
3
[FEATURE] CC Filter
#283
BramVanroy
opened
3 weeks ago
0
Add `job_id_position` Parameter to `launch_slurm_job` Method
#282
StephenRebel
opened
3 weeks ago
0
RuntimeError: An attempt has been made to start a new process before the
#281
shizhediao
closed
3 weeks ago
1
Readme nits
#280
hynky1999
closed
3 weeks ago
0
Hangs Due to `std::length_error` in Data Processing Pipeline
#279
justHungryMan
closed
6 days ago
10
JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad
#278
nelson-liu
opened
1 month ago
2
UnicodeDecodeError When Using Korean Tokenizer
#277
justHungryMan
closed
6 days ago
6
Update README.md
#276
shizhediao
closed
1 month ago
1
TypeError: datasets.load.load_dataset() argument after ** must be a mapping, not NoneType
#275
shizhediao
closed
1 month ago
1
Update filter_hf_dataset.py
#274
shizhediao
closed
1 month ago
1
blank lines treated as duplicates in the GopherFilter
#273
lfoppiano
closed
1 month ago
4
Can you make a release with the latest batch version?
#272
nldxtd
closed
1 month ago
1
Video support for datatrove
#271
guipenedo
opened
1 month ago
0
Filter very slow
#270
hiennm15
opened
1 month ago
5
example for decontamination
#269
jordane95
opened
1 month ago
0
Add expand_metadata Option to JsonlWriter
#268
justHungryMan
closed
1 month ago
1
Clarification Needed on 'Tasks' Terminology in Datatrove
#267
eltonjohnfanboy
closed
1 month ago
2
How to look into the processed data?
#266
shizhediao
opened
1 month ago
3
Incorrect Job ID Extraction on Clusters with Custom Slurm Output
#265
StephenRebelSSC
opened
1 month ago
2
Fix `test_basic_article_trafilatura` test failure
#264
tylerjthomas9
closed
1 month ago
1
Optimize parquet output for remote reading
#263
H-Plus-Time
opened
1 month ago
2
Is there a plan to process data in sft format
#262
simplew2011
opened
2 months ago
1
Sentence deduplication output
#261
solene-evain
closed
1 month ago
6
Reading tokenized datasets
#260
davidbrandfonbrener
closed
2 months ago
2
Option to Not Upload to Hugging Face Hub in HuggingFaceDatasetWriter
#259
justHungryMan
closed
3 weeks ago
4
no qos driving to invalid qos specification
#258
solene-evain
closed
1 month ago
6
fix correct type inference for cached filesystems
#257
hynky1999
closed
2 months ago
0
boto timeout when I read CC with Warc
#256
marcopasqua
closed
1 month ago
1
Add Dataset type parameters
#255
aiqwe
closed
3 weeks ago
2
How about addding custom word_tokenizers?
#254
aiqwe
opened
2 months ago
0
Simple enhancement for readibility
#253
aiqwe
closed
1 month ago
1
Update MinhashConfig with detailed settings and add default language …
#252
justHungryMan
closed
1 month ago
1
Add token and char count to histogram stats
#251
guipenedo
closed
2 months ago
0
Fix SENTINEL cluster
#250
jordane95
opened
2 months ago
1
Potential issue of dedup in index
#249
jordane95
opened
2 months ago
0
solved: how to launch a slurm executor from an interactive slurm job
#248
stas00
opened
2 months ago
0
GopherQualityFilter.__init__() got an unexpected keyword argument 'language' on kaggle
#247
williamlin0518
closed
2 months ago
2
adding an automatic log file `tail` under slurm executor
#246
stas00
opened
2 months ago
1
Slurm Cluster Size for FineWeb
#245
axelmagn
opened
2 months ago
0
Next