togethercomputer RedPajama-Data issues

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.53k stars 346 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

ArXiv cleaning issue

#68 hicotton02 closed 11 months ago
1
Failed building wheel for cc-net

#67 hicotton02 closed 11 months ago
2
Unlock open science for dataset generation

#66 AbcSxyZ opened 1 year ago
0
I got an issue when I use fasttext doing arxiv cleaning.

#65 tangtianyi1998 opened 1 year ago
1
We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet

#64 shawn0wang closed 1 month ago
1
How do i prepare the data for Visualisation?

#63 dittops opened 1 year ago
2
Understanding the quality filter

#62 yonatanbitton closed 1 month ago
5
Fine tuning RedPajama Model

#61 adjhawar closed 1 month ago
1
Single machine download script and downloaded files check

#60 MIracleyin opened 1 year ago
0
Add arXiv download script and downloaded file check

#59 MIracleyin closed 1 year ago
0
result not contain raw content

#58 newbietuan closed 1 year ago
0
What is the cutoff.csv file mentioned in data_prep/cc/cc_net/cc_net/mine.py?

#57 julienliang2740 closed 1 year ago
2
Fixed data preparation for stack_exchange

#56 Taishi-N324 opened 1 year ago
0
Memory requirement for book deduplication?

#55 HYLcool closed 1 year ago
2
error while download from url

#54 guozhiyao closed 1 month ago
1
No file named github-prepare-local-dedup.sh

#53 feverdreamy closed 1 year ago
1
Drive space to store

#52 tstandley closed 1 year ago
1
how to process arXiv tex files without downloading?

#51 irene622 closed 1 year ago
1
The left portion of the dataset after each process

#50 kimcando opened 1 year ago
0
Any forecast for the realese of v2 of the dataset ?

#49 vince62s closed 1 month ago
0
Overlap between Common Crawl and C4

#48 codesoap closed 1 year ago
6
how much disk memory will be used？

#47 newbietuan opened 1 year ago
3
Language diversity

#46 averkij closed 1 year ago
3
Partially downloaded datasets

#45 soboleva-daria closed 1 month ago
1
The training data for Quality Classifier

#44 Anery closed 1 year ago
4
about download a small portion of cc

#43 newbietuan closed 1 year ago
2
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 2556: invalid start byte

#42 Anery closed 1 year ago
1
fix bugs in hash using getpy

#41 feifeibear closed 1 year ago
0
Question about Size Difference of arXiv Data in RedPajama and AWS S3

#40 Mr-Philo closed 1 year ago
2
Common Crawl metadata

#39 mauriceweber closed 1 month ago
0
add missing files to run cc_net with a given config

#38 feifeibear opened 1 year ago
0
Questions about the quality classifier in common crawl

#37 ladit closed 1 month ago
8
fixed some errors in Makefile for lm preparation

#36 feifeibear opened 1 year ago
2
Expected finish time for processing one single index of commoncrawl?

#35 kimcando closed 1 month ago
4
EOFError: Compressed file ended before the end-of-stream marker was reached

#34 kimcando closed 1 year ago
2
Memory and space requirements

#33 zetian1025 closed 1 year ago
2
Improve CC readme and add URL to download classifier weights

#32 Ivan-Zhou closed 1 year ago
0
How can you map the common crawl source back to metadata?

#31 craigschmidt closed 1 year ago
2
If the program exit with the outside cause

#30 1787648106 closed 1 year ago
1
fix clean copyright

#29 hust-nj opened 1 year ago
2
[Errno 2] No such file or directory: 'cutoff.csv'

#28 Anery closed 1 year ago
5
Add missing script or update README.md

#27 geoffreydstewart closed 1 year ago
3
How the 5 dumps of Common Crawl are selected?

#26 Stanislas0 closed 1 year ago
2
Script fixes in data_prep/github

#25 geoffreydstewart closed 1 year ago
1
where is the FastText ptrtrained model to classify each CommonCrawl webpage

#24 yuhai-china closed 1 year ago
4
Got error while runing `python -m cc_net -l my -l gu`

#23 tiendung closed 1 year ago
9
Please consider adding a source of natural dialogue data

#22 wassname closed 1 year ago
2
`No module named 'datasets'` in `data_prep/book/`

#21 danielpclark closed 1 year ago
5
Fix typo in github/README.md

#20 eltociear closed 1 year ago
1
will there be a trained model?

#19 rozek closed 1 year ago
2

Previous Next