Closed humzaiqbal closed 3 months ago
Hi @humzaiqbal - sorry for the late response!
I can't seem to reproduce this with these versions of ray and pyarrow - could you share a pip freeze of your environment? Thanks!
So I actually did a bit more digging and the issue seems to come when some of the documents are empty. Heres an excerpt of a processed document from Common Crawl I saw
{"text": ["\u0427\u0442\u043e \u0437\u043d\u0430\u0447\u0438\u0442 \u043d\u043e\u0441\u0438\u0442\u0435\u043b\u044c \u0432\u0438\u0440\u0443\u0441\u0430 \u0433\u0435\u043f\u0430\u0442\u0438\u0442\u0430 \u0441?","", "", "", "", ""}
I grabbed the tokens and it looked like
{'tokens': array([list([[140, 102, 6205, 8793, 39178, 7620, 4418, 13043, 43250, 4407, 1389, 16828, 1674, 1152, 13192, 31220, 7065, 7620, 1152, 4189, 32], [], [], [], [], []}
Heres a pip freeze though.
absl-py==1.4.0
accelerate==0.25.0
aiobotocore==2.5.2
aiohttp==3.9.5
aiohttp-cors==0.7.0
aioitertools==0.11.0
aioredis==1.3.1
aiosignal==1.3.1
alembic==1.13.2
aniso8601==9.0.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
apache-libcloud==3.8.0
argcomplete==3.4.0
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.2.0
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.20.0
azure-storage-file-datalake==12.15.0
backoff==2.2.1
bcrypt==4.1.3
beautifulsoup4==4.12.3
blessed==1.20.0
blingfire==0.1.8
bokeh @ file:///home/conda/feedstock_root/build_artifacts/bokeh_1652969581850/work
boto3==1.26.145
botocore==1.29.161
braceexpand==0.1.7
Brotli==1.0.9
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854164373/work
cached-property==1.5.2
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1640686991047/work
catalogue==2.0.10
certifi==2024.7.4
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1656782830073/work
cfgv==3.4.0
charset-normalizer==3.3.2
chex==0.1.5
circuitbreaker==1.4.0
click==8.1.7
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1
cloudpathlib==0.18.1
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1653061851209/work
clu==0.0.8
cmake==3.27.5
coloredlogs==15.0.1
colorful==0.5.6
contextlib2==21.6.0
contourpy==1.2.1
coolname==2.2.0
cramjam==2.8.3
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1657173995269/work
cuda-python @ file:///opt/conda/conda-bld/cuda-python_1650876346274/work
cudf==22.6.1
cuml==22.6.1
cupy @ file:///home/conda/feedstock_root/build_artifacts/cupy_1656823137380/work
cycler==0.11.0
Cython==0.29.33
cytoolz @ file:///home/conda/feedstock_root/build_artifacts/cytoolz_1657553452326/work
dash==2.5.1
dash-bootstrap-components==1.1.0
dash-core-components==2.0.0
dash-daq==0.5.0
dash-html-components==2.0.0
dash-table==5.0.0
dask @ file:///home/conda/feedstock_root/build_artifacts/dask-core_1653603260862/work
dask-cuda==22.6.0
dask-cudf==22.6.1
databricks-sdk==0.28.0
datasets==2.19.2
decorator==5.1.1
Deprecated==1.2.14
dill==0.3.8
distlib==0.3.6
distributed @ file:///home/conda/feedstock_root/build_artifacts/distributed_1653607935301/work
dm-tree==0.1.8
dnspython==2.6.1
docker==7.1.0
docker-pycreds==0.4.0
einops==0.7.0
email_validator==2.2.0
entrypoints==0.4
etils==1.0.0
exceptiongroup==1.2.1
executing==2.0.1
faiss-gpu==1.7.2
Farama-Notifications==0.0.4
fastapi==0.111.0
fastapi-cli==0.0.4
fastavro @ file:///home/conda/feedstock_root/build_artifacts/fastavro_1658266249148/work
fastrlock==0.8
fasttext==0.9.3
filelock==3.15.4
fire==0.6.0
Flask==2.1.3
Flask-Compress==1.12
flatbuffers==23.1.21
flax==0.6.4
fonttools==4.34.4
frozenlist==1.4.1
fsspec==2023.6.0
ftfy==6.2.0
gast==0.4.0
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.8.2
google-auth==2.16.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.4.1
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.1
googleapis-common-protos==1.58.0
gpustat==1.0.0
gql==3.5.0
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.3
grpcio==1.43.0
gunicorn==22.0.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.8.0
HeapDict==1.0.1
hiredis==2.2.3
horovod==0.28.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.0
humanfriendly==10.0
identify==2.6.0
idna==3.7
imageio==2.25.0
immutabledict==2.2.3
importlib-metadata==6.11.0
importlib-resources==5.10.2
iniconfig==2.0.0
ipython==8.18.1
isodate==0.6.1
itsdangerous==2.1.2
jax==0.4.2
jax-jumpy==1.0.0
jaxlib==0.4.2+cuda11.cudnn86
jedi==0.19.1
Jinja2==3.1.3
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1633637554808/work
jsonlines==4.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jusText==3.0.1
kenlm @ https://github.com/kpu/kenlm/archive/master.zip#sha256=9aca61fb9df045ad86203e04b750e787403dfe4d7b86b3e99173a29f5d12d3c6
keras==2.11.0
kiwisolver==1.4.4
langdetect==1.0.9
libclang==15.0.6.1
lightning-utilities==0.11.3.post0
linkify-it-py==2.0.3
lit==16.0.6
llm-foundry==0.10.0
llvmlite==0.38.1
locket @ file:///home/conda/feedstock_root/build_artifacts/locket_1650660393415/work
loguru==0.7.2
lvis==0.5.3
lxml==5.2.2
lxml_html_clean==0.1.1
lz4 @ file:///home/conda/feedstock_root/build_artifacts/lz4_1652795536065/work
Mako==1.3.5
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.1
matplotlib-inline==0.1.3
mdit-py-plugins==0.4.1
mdurl==0.1.2
memray==1.13.3
ml-collections==0.1.1
mlflow==2.14.2
mosaicml==0.23.5
mosaicml-cli==0.6.37
mosaicml-streaming==0.7.6
moto==5.0.11
mpmath==1.3.0
msal==1.29.0
msal-extensions==1.2.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
networkx==3.2.1
nltk==3.8.1
nodeenv==1.9.1
numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1655473307261/work
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.5.0.96
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.2.10.91
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.4.91
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==11.495.46
nvidia-nccl-cu11==2.14.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu11==11.7.91
nvidia-nvtx-cu12==12.1.105
nvtx @ file:///home/conda/feedstock_root/build_artifacts/nvtx_1637264773680/work
oauthlib==3.2.2
oci==2.129.1
omegaconf==2.3.0
onnx==1.14.0
onnxruntime==1.15.1
open-clip-torch==2.24.0
open_lm @ git+https://github.com/mlfoundations/open_lm.git@b84ddbec9b058531177ef05966437ac81c0b40d5
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python==4.7.0.68
opentelemetry-api==1.25.0
opentelemetry-exporter-otlp==1.25.0
opentelemetry-exporter-otlp-proto-common==1.25.0
opentelemetry-exporter-otlp-proto-grpc==1.25.0
opentelemetry-exporter-otlp-proto-http==1.25.0
opentelemetry-proto==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-semantic-conventions==0.46b0
opt-einsum==3.3.0
optax @ git+https://github.com/deepmind/optax.git@d9157132b8661e1453d156f591784ff27d6d1141
orbax==0.1.1
orjson==3.10.6
ott-jax==0.3.1
packaging==24.1
pandas==2.1.4
paramiko==3.4.0
parso==0.8.4
partd @ file:///home/conda/feedstock_root/build_artifacts/partd_1617910651905/work
pexpect==4.9.0
pillow==10.4.0
platformdirs==3.5.1
plotly==5.6.0
pluggy==1.5.0
portalocker==2.10.0
pre-commit==3.7.1
prometheus_client==0.20.0
promise==2.3
prompt_toolkit==3.0.47
protobuf==4.25.3
psutil==5.9.5
ptxcompiler @ file:///datasets/bzaitlen/miniconda3/conda-bld/ptxcompiler_1643206592709/work
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.13.1
pycocotools==2.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.14.0
PyJWT==2.8.0
PyNaCl==1.5.0
pynvml==11.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1643496850550/work
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1652235407899/work
pyrsistent==0.19.3
pysimdjson==6.0.2
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1648857263093/work
pytest==8.2.2
pytest-timeout==2.3.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
python-dotenv==1.0.1
python-multipart==0.0.9
python-snappy==0.7.2
pytorch-ranger==0.1.1
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1647961439546/work
PyWavelets==1.4.1
PyYAML==6.0.1
querystring-parser==1.2.4
questionary==1.10.0
raft==22.6.0
ray==2.31.0
ray-cpp==2.31.0
referencing==0.35.1
regex==2023.12.25
requests==2.32.3
requests-oauthlib==1.3.1
responses==0.25.3
retrie==0.3.1
rich==13.7.1
rmm==21.12.0
rpds-py==0.19.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3fs==2023.6.0
s3transfer==0.6.1
safetensors==0.4.2
scenic @ file:///home/ubuntu/scenic
scikit-image==0.19.3
scikit-learn==1.1.1
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy_1658810968466/work
sentencepiece==0.1.97
sentry-sdk==2.8.0
setproctitle==1.3.3
shellingham==1.5.4
six==1.16.0
slack_sdk==3.31.0
smart-open==7.0.4
smmap==5.0.1
sniffio==1.3.1
sortedcontainers @ file:///home/conda/feedstock_root/build_artifacts/sortedcontainers_1621217038088/work
soupsieve==2.5
SQLAlchemy==2.0.31
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
sympy==1.12
tabulate==0.9.0
tblib @ file:///home/conda/feedstock_root/build_artifacts/tblib_1616261298899/work
tenacity==8.5.0
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6.2.2
tensorflow==2.11.0
tensorflow-addons==0.19.0
tensorflow-datasets==4.8.2
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.30.0
tensorflow-metadata==1.12.0
tensorstore==0.1.31
termcolor==2.2.0
textual==0.71.0
threadpoolctl==3.1.0
tifffile==2023.2.2
tiktoken==0.7.0
timm==0.9.16
tokenizers==0.19.1
toml==0.10.2
tomli==2.0.1
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch==2.3.0
torch-optimizer==0.3.0
torchaudio==2.0.2
torchmetrics==1.3.2
torchvision==0.18.0
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827245914/work
tqdm==4.66.2
traitlets==5.3.0
transformers==4.40.2
treelite==2.4.0
treelite-runtime==2.4.0
triton==2.3.0
typeguard==2.13.3
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.2
tzdata==2024.1
uc-micro-py==1.0.3
ucx-py @ file:///opt/conda/envs/rapids/conda-bld/work
ujson==5.10.0
Unidecode==1.3.8
uniseg==0.8.0
urllib3==1.26.19
uvicorn==0.30.1
uvloop==0.19.0
validators==0.31.0
virtualenv==20.21.0
wandb==0.17.4
watchfiles==0.22.0
wcwidth==0.2.13
webdataset==0.2.48
websockets==11.0.3
Werkzeug==3.0.3
wikipedia==1.4.0
wrapt==1.14.1
xformers==0.0.26.post1
xmltodict==0.13.0
xxhash==3.4.1
yarl==1.9.4
zict @ file:///home/conda/feedstock_root/build_artifacts/zict_1651156074437/work
zipp==3.8.1
zstandard==0.22.0
zstd==1.5.5.1
Interesting - given the above, it's probably an issue with the data rather than the environment.
In the example you provide, it seems that the text
field points to a list of documents - is that right? Each line in the processed files should be a json corresponding to a single document, where the text
field is only a string.
Hey @GeorgiosSmyrnis apologies for the delay here but managed to RCA the cause. It looks like this happened when I ran with the C4 yaml, there is a function thats called split_line_modifier
which splits the text based on newlines. When we have text like (the below comes from a raw doc in the dataset)
"view cart menu separator categories menu separator faq\nadvanced search\ncategories \u00a0>\u00a0Handmade Shea Butter Soap (11)\nHandmade Cool Peppermint Shea Butter Soap 4oz.\n\u00a0\n\nHandmade Cool Peppermint Shea Butter Soap 4oz.\n\nPrice: $4.00 add to cart"
when we split on new lines we get
['view cart menu separator categories menu separator faq', 'advanced search', 'categories \xa0>\xa0Handmade Shea Butter Soap (11)', 'Handmade Cool Peppermint Shea Butter Soap 4oz.', '\xa0', '', 'Handmade Cool Peppermint Shea Butter Soap 4oz.', '', 'Price: $4.00 add to cart']
if you'll notice some of the entries are empty ''
and as a result we will save that as a processed JSON and then we call tokenize on that we will get an empty list which triggers the above. Wondering what makes sense here. My guess would be that before we save the processed JSON it would be good to get rid of any blank strings. Do you think that makes sense to add to the code as a general post processing tool? If so I'm happy to put up a PR or if not I may just do this on my own end.
Hi @humzaiqbal,
The c4 reproduction yaml that we provide contains a join_lines_modifier
function at a later stage, which combines the text back into one string, and a page length filter later on which should filter out remaining empty strings. Does the C4 processing complete without any issues? If so, then you should not have the above output in your final json.
In any case, I think that if the text
field in the final JSON object is a list of strings and not a single string, then that would cause issues - so given the output you shared above, that might be the case.
I notice that it doesn't actually go the whole way through, what happens is it stops around the first exact_dedup
step here the reason being that in the code it checks to see if the function of the step is in something called GLOBAL_FUNCTIONS which only contains exact_dedup
and when that check is met we break. That means the other steps which you mention above aren't being triggered which may be the problem.
I did a test where I changed the break to a continue and my text output definitely looks better
{"text": "The same neocons who persuaded George W. Bush and crew to, in Ron Paul\u2019s inimitable words, \u201clie their way into invading Iraq\u201d in 2003, are beating the drums of war more loudly these days to attack Iran. It is remarkable how many of these war-mongers are former draft dodgers who wanted other Americans to fight the war in Vietnam.\n\nWith the exception of Ron Paul, who actually knows the history of US-Iranian relations, the Republican presidential contenders have declared their belligerency toward Iranian officials who they accuse of moving toward nuclear weapons.\n\nWhile many western and some Arab countries in the Gulf region have condemned Iran\u2019s alleged nuclear arms quest, Israel maintains some 200 ready nuclear weapons and has refused to sign the non-proliferation treaty, thereby avoiding the IAEA inspectors.\n\nIsraelis in the know have much to say. Defense minister, Ehud Barak
Hi @humzaiqbal ,
My apologies for the late response - I am looking into this issue and will update you when I have a solution. It seems the underlying problem is the exact dedup step not being run properly.
Can you point me to the file that causes this?
Thanks!
@humzaiqbal I think I actually identified the issue and pushed a small fix in #23 - if you are using the HF provided data then this should make the entire process work without changes.
Please let me know if there are still issues!
Gotcha thanks! One question, while I was waiting for this to be resolved I wound up making a fix myself which seemed to get the processed files into the right format and finish up processing by changing the break to a continue. What kind of impacts might this have? I'm wondering if I need to rerun processing
I think that changing that break to continue essentially skips the exact_dedup
steps defined in the yaml file (so the exact duplicate filters are not applied). I believe that this does affect results, unfortunately - sorry for the inconvenience!
Gotcha makes sense, thanks will close this issue since the original problem is resolved!
When running tokenization I get this
If it helps here are some relevant package versions