mlfoundations / dclm

DataComp for Language Models
MIT License
1.15k stars 104 forks source link

ArrowConversionError when running tokenization #20

Closed humzaiqbal closed 3 months ago

humzaiqbal commented 4 months ago

When running tokenization I get this


Traceback (most recent call last):
  File "/home/ubuntu/DCLM/ray_processing/tokenize_shuffle.py", line 114, in <module>
    main(args, DCNLP_ARGS)
  File "/home/ubuntu/DCLM/ray_processing/tokenize_shuffle.py", line 100, in main
    tokenize_shuffle.main(tokenize_shuffle_args)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/open_lm/datapreprocess/ray/tokenize_shuffle.py", line 693, in main
    write_status = ds.map_batches(
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/dataset.py", line 2425, in take_all
    for row in self.iter_rows():
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/iterator.py", line 244, in _wrapped_iterator
    for batch in batch_iterable:
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/iterator.py", line 161, in _create_iterator
    block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator()
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/iterator/iterator_impl.py", line 33, in _to_block_iterator
    block_iterator, stats, executor = ds._plan.execute_to_iterator()
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ArrowConversionError): ray::FlatMap(<lambda>)->Map(add_hash)() (pid=8490, ip=10.1.16.183)
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5347, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 373, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 339, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values

The above exception was the direct cause of the following exception:

ray::FlatMap(<lambda>)->Map(add_hash)() (pid=8490, ip=10.1.16.183)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 438, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 393, in __call__
    add_fn(data)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/output_buffer.py", line 43, in add
    self._buffer.add(item)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add
    self._builder.add(item)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/table_block.py", line 86, in add
    self._compact_if_needed()
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/table_block.py", line 155, in _compact_if_needed
    block = self._table_from_pydict(columns)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/data/_internal/arrow_block.py", line 146, in _table_from_pydict
    return pyarrow_table_from_pydict(columns)
  File "/home/ubuntu/miniconda3/envs/rapids-22.06/lib/python3.9/site-packages/ray/air/util/tensor_extensions/arrow.py", line 84, in pyarrow_table_from_pydict
    raise ArrowConversionError(str(pydict)) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: {'tokens': array([list([[11586, 434, 6096, 4096, 310, 346, 27642, 272, 6491, 15, 914, 2996, 253, 3645, 449, 844, 403, 625, 7106, 685, 2455, 327, 776, 11847, 281, 6383, 949, 247, 2962, 273, 32771, 285,...
/home/ubuntu/research_nfs/humza/processed_data/c4/processed_data/CC_shard_00001367_processed.jsonl.zst: : 53828it [00:54, 991.68it/s]

If it helps here are some relevant package versions

ray==2.31.0
pyarrow==15.0.2
GeorgiosSmyrnis commented 4 months ago

Hi @humzaiqbal - sorry for the late response!

I can't seem to reproduce this with these versions of ray and pyarrow - could you share a pip freeze of your environment? Thanks!

humzaiqbal commented 4 months ago

So I actually did a bit more digging and the issue seems to come when some of the documents are empty. Heres an excerpt of a processed document from Common Crawl I saw

{"text": ["\u0427\u0442\u043e \u0437\u043d\u0430\u0447\u0438\u0442 \u043d\u043e\u0441\u0438\u0442\u0435\u043b\u044c \u0432\u0438\u0440\u0443\u0441\u0430 \u0433\u0435\u043f\u0430\u0442\u0438\u0442\u0430 \u0441?","", "", "", "", ""}

I grabbed the tokens and it looked like

{'tokens': array([list([[140, 102, 6205, 8793, 39178, 7620, 4418, 13043, 43250, 4407, 1389, 16828, 1674, 1152, 13192, 31220, 7065, 7620, 1152, 4189, 32], [], [], [], [], []}

Heres a pip freeze though.

absl-py==1.4.0
accelerate==0.25.0
aiobotocore==2.5.2
aiohttp==3.9.5
aiohttp-cors==0.7.0
aioitertools==0.11.0
aioredis==1.3.1
aiosignal==1.3.1
alembic==1.13.2
aniso8601==9.0.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
apache-libcloud==3.8.0
argcomplete==3.4.0
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.2.0
azure-core==1.30.2
azure-identity==1.17.1
azure-storage-blob==12.20.0
azure-storage-file-datalake==12.15.0
backoff==2.2.1
bcrypt==4.1.3
beautifulsoup4==4.12.3
blessed==1.20.0
blingfire==0.1.8
bokeh @ file:///home/conda/feedstock_root/build_artifacts/bokeh_1652969581850/work
boto3==1.26.145
botocore==1.29.161
braceexpand==0.1.7
Brotli==1.0.9
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854164373/work
cached-property==1.5.2
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1640686991047/work
catalogue==2.0.10
certifi==2024.7.4
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1656782830073/work
cfgv==3.4.0
charset-normalizer==3.3.2
chex==0.1.5
circuitbreaker==1.4.0
click==8.1.7
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1
cloudpathlib==0.18.1
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1653061851209/work
clu==0.0.8
cmake==3.27.5
coloredlogs==15.0.1
colorful==0.5.6
contextlib2==21.6.0
contourpy==1.2.1
coolname==2.2.0
cramjam==2.8.3
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1657173995269/work
cuda-python @ file:///opt/conda/conda-bld/cuda-python_1650876346274/work
cudf==22.6.1
cuml==22.6.1
cupy @ file:///home/conda/feedstock_root/build_artifacts/cupy_1656823137380/work
cycler==0.11.0
Cython==0.29.33
cytoolz @ file:///home/conda/feedstock_root/build_artifacts/cytoolz_1657553452326/work
dash==2.5.1
dash-bootstrap-components==1.1.0
dash-core-components==2.0.0
dash-daq==0.5.0
dash-html-components==2.0.0
dash-table==5.0.0
dask @ file:///home/conda/feedstock_root/build_artifacts/dask-core_1653603260862/work
dask-cuda==22.6.0
dask-cudf==22.6.1
databricks-sdk==0.28.0
datasets==2.19.2
decorator==5.1.1
Deprecated==1.2.14
dill==0.3.8
distlib==0.3.6
distributed @ file:///home/conda/feedstock_root/build_artifacts/distributed_1653607935301/work
dm-tree==0.1.8
dnspython==2.6.1
docker==7.1.0
docker-pycreds==0.4.0
einops==0.7.0
email_validator==2.2.0
entrypoints==0.4
etils==1.0.0
exceptiongroup==1.2.1
executing==2.0.1
faiss-gpu==1.7.2
Farama-Notifications==0.0.4
fastapi==0.111.0
fastapi-cli==0.0.4
fastavro @ file:///home/conda/feedstock_root/build_artifacts/fastavro_1658266249148/work
fastrlock==0.8
fasttext==0.9.3
filelock==3.15.4
fire==0.6.0
Flask==2.1.3
Flask-Compress==1.12
flatbuffers==23.1.21
flax==0.6.4
fonttools==4.34.4
frozenlist==1.4.1
fsspec==2023.6.0
ftfy==6.2.0
gast==0.4.0
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.8.2
google-auth==2.16.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.4.1
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.1
googleapis-common-protos==1.58.0
gpustat==1.0.0
gql==3.5.0
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.3
grpcio==1.43.0
gunicorn==22.0.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.8.0
HeapDict==1.0.1
hiredis==2.2.3
horovod==0.28.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.0
humanfriendly==10.0
identify==2.6.0
idna==3.7
imageio==2.25.0
immutabledict==2.2.3
importlib-metadata==6.11.0
importlib-resources==5.10.2
iniconfig==2.0.0
ipython==8.18.1
isodate==0.6.1
itsdangerous==2.1.2
jax==0.4.2
jax-jumpy==1.0.0
jaxlib==0.4.2+cuda11.cudnn86
jedi==0.19.1
Jinja2==3.1.3
jmespath==1.0.1
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1633637554808/work
jsonlines==4.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jusText==3.0.1
kenlm @ https://github.com/kpu/kenlm/archive/master.zip#sha256=9aca61fb9df045ad86203e04b750e787403dfe4d7b86b3e99173a29f5d12d3c6
keras==2.11.0
kiwisolver==1.4.4
langdetect==1.0.9
libclang==15.0.6.1
lightning-utilities==0.11.3.post0
linkify-it-py==2.0.3
lit==16.0.6
llm-foundry==0.10.0
llvmlite==0.38.1
locket @ file:///home/conda/feedstock_root/build_artifacts/locket_1650660393415/work
loguru==0.7.2
lvis==0.5.3
lxml==5.2.2
lxml_html_clean==0.1.1
lz4 @ file:///home/conda/feedstock_root/build_artifacts/lz4_1652795536065/work
Mako==1.3.5
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.1
matplotlib-inline==0.1.3
mdit-py-plugins==0.4.1
mdurl==0.1.2
memray==1.13.3
ml-collections==0.1.1
mlflow==2.14.2
mosaicml==0.23.5
mosaicml-cli==0.6.37
mosaicml-streaming==0.7.6
moto==5.0.11
mpmath==1.3.0
msal==1.29.0
msal-extensions==1.2.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
networkx==3.2.1
nltk==3.8.1
nodeenv==1.9.1
numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1655473307261/work
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.5.0.96
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.2.10.91
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.4.91
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==11.495.46
nvidia-nccl-cu11==2.14.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu11==11.7.91
nvidia-nvtx-cu12==12.1.105
nvtx @ file:///home/conda/feedstock_root/build_artifacts/nvtx_1637264773680/work
oauthlib==3.2.2
oci==2.129.1
omegaconf==2.3.0
onnx==1.14.0
onnxruntime==1.15.1
open-clip-torch==2.24.0
open_lm @ git+https://github.com/mlfoundations/open_lm.git@b84ddbec9b058531177ef05966437ac81c0b40d5
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python==4.7.0.68
opentelemetry-api==1.25.0
opentelemetry-exporter-otlp==1.25.0
opentelemetry-exporter-otlp-proto-common==1.25.0
opentelemetry-exporter-otlp-proto-grpc==1.25.0
opentelemetry-exporter-otlp-proto-http==1.25.0
opentelemetry-proto==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-semantic-conventions==0.46b0
opt-einsum==3.3.0
optax @ git+https://github.com/deepmind/optax.git@d9157132b8661e1453d156f591784ff27d6d1141
orbax==0.1.1
orjson==3.10.6
ott-jax==0.3.1
packaging==24.1
pandas==2.1.4
paramiko==3.4.0
parso==0.8.4
partd @ file:///home/conda/feedstock_root/build_artifacts/partd_1617910651905/work
pexpect==4.9.0
pillow==10.4.0
platformdirs==3.5.1
plotly==5.6.0
pluggy==1.5.0
portalocker==2.10.0
pre-commit==3.7.1
prometheus_client==0.20.0
promise==2.3
prompt_toolkit==3.0.47
protobuf==4.25.3
psutil==5.9.5
ptxcompiler @ file:///datasets/bzaitlen/miniconda3/conda-bld/ptxcompiler_1643206592709/work
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==15.0.2
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.13.1
pycocotools==2.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.14.0
PyJWT==2.8.0
PyNaCl==1.5.0
pynvml==11.5.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1643496850550/work
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1652235407899/work
pyrsistent==0.19.3
pysimdjson==6.0.2
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1648857263093/work
pytest==8.2.2
pytest-timeout==2.3.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
python-dotenv==1.0.1
python-multipart==0.0.9
python-snappy==0.7.2
pytorch-ranger==0.1.1
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1647961439546/work
PyWavelets==1.4.1
PyYAML==6.0.1
querystring-parser==1.2.4
questionary==1.10.0
raft==22.6.0
ray==2.31.0
ray-cpp==2.31.0
referencing==0.35.1
regex==2023.12.25
requests==2.32.3
requests-oauthlib==1.3.1
responses==0.25.3
retrie==0.3.1
rich==13.7.1
rmm==21.12.0
rpds-py==0.19.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3fs==2023.6.0
s3transfer==0.6.1
safetensors==0.4.2
scenic @ file:///home/ubuntu/scenic
scikit-image==0.19.3
scikit-learn==1.1.1
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy_1658810968466/work
sentencepiece==0.1.97
sentry-sdk==2.8.0
setproctitle==1.3.3
shellingham==1.5.4
six==1.16.0
slack_sdk==3.31.0
smart-open==7.0.4
smmap==5.0.1
sniffio==1.3.1
sortedcontainers @ file:///home/conda/feedstock_root/build_artifacts/sortedcontainers_1621217038088/work
soupsieve==2.5
SQLAlchemy==2.0.31
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
sympy==1.12
tabulate==0.9.0
tblib @ file:///home/conda/feedstock_root/build_artifacts/tblib_1616261298899/work
tenacity==8.5.0
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6.2.2
tensorflow==2.11.0
tensorflow-addons==0.19.0
tensorflow-datasets==4.8.2
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.30.0
tensorflow-metadata==1.12.0
tensorstore==0.1.31
termcolor==2.2.0
textual==0.71.0
threadpoolctl==3.1.0
tifffile==2023.2.2
tiktoken==0.7.0
timm==0.9.16
tokenizers==0.19.1
toml==0.10.2
tomli==2.0.1
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch==2.3.0
torch-optimizer==0.3.0
torchaudio==2.0.2
torchmetrics==1.3.2
torchvision==0.18.0
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827245914/work
tqdm==4.66.2
traitlets==5.3.0
transformers==4.40.2
treelite==2.4.0
treelite-runtime==2.4.0
triton==2.3.0
typeguard==2.13.3
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.2
tzdata==2024.1
uc-micro-py==1.0.3
ucx-py @ file:///opt/conda/envs/rapids/conda-bld/work
ujson==5.10.0
Unidecode==1.3.8
uniseg==0.8.0
urllib3==1.26.19
uvicorn==0.30.1
uvloop==0.19.0
validators==0.31.0
virtualenv==20.21.0
wandb==0.17.4
watchfiles==0.22.0
wcwidth==0.2.13
webdataset==0.2.48
websockets==11.0.3
Werkzeug==3.0.3
wikipedia==1.4.0
wrapt==1.14.1
xformers==0.0.26.post1
xmltodict==0.13.0
xxhash==3.4.1
yarl==1.9.4
zict @ file:///home/conda/feedstock_root/build_artifacts/zict_1651156074437/work
zipp==3.8.1
zstandard==0.22.0
zstd==1.5.5.1
GeorgiosSmyrnis commented 4 months ago

Interesting - given the above, it's probably an issue with the data rather than the environment.

In the example you provide, it seems that the text field points to a list of documents - is that right? Each line in the processed files should be a json corresponding to a single document, where the text field is only a string.

humzaiqbal commented 4 months ago

Hey @GeorgiosSmyrnis apologies for the delay here but managed to RCA the cause. It looks like this happened when I ran with the C4 yaml, there is a function thats called split_line_modifier which splits the text based on newlines. When we have text like (the below comes from a raw doc in the dataset)

"view cart menu separator categories menu separator faq\nadvanced search\ncategories \u00a0>\u00a0Handmade Shea Butter Soap (11)\nHandmade Cool Peppermint Shea Butter Soap 4oz.\n\u00a0\n\nHandmade Cool Peppermint Shea Butter Soap 4oz.\n\nPrice: $4.00 add to cart"

when we split on new lines we get

['view cart menu separator categories menu separator faq', 'advanced search', 'categories \xa0>\xa0Handmade Shea Butter Soap (11)', 'Handmade Cool Peppermint Shea Butter Soap 4oz.', '\xa0', '', 'Handmade Cool Peppermint Shea Butter Soap 4oz.', '', 'Price: $4.00 add to cart']

if you'll notice some of the entries are empty '' and as a result we will save that as a processed JSON and then we call tokenize on that we will get an empty list which triggers the above. Wondering what makes sense here. My guess would be that before we save the processed JSON it would be good to get rid of any blank strings. Do you think that makes sense to add to the code as a general post processing tool? If so I'm happy to put up a PR or if not I may just do this on my own end.

GeorgiosSmyrnis commented 4 months ago

Hi @humzaiqbal,

The c4 reproduction yaml that we provide contains a join_lines_modifier function at a later stage, which combines the text back into one string, and a page length filter later on which should filter out remaining empty strings. Does the C4 processing complete without any issues? If so, then you should not have the above output in your final json.

In any case, I think that if the text field in the final JSON object is a list of strings and not a single string, then that would cause issues - so given the output you shared above, that might be the case.

humzaiqbal commented 4 months ago

I notice that it doesn't actually go the whole way through, what happens is it stops around the first exact_dedup step here the reason being that in the code it checks to see if the function of the step is in something called GLOBAL_FUNCTIONS which only contains exact_dedup and when that check is met we break. That means the other steps which you mention above aren't being triggered which may be the problem.

humzaiqbal commented 4 months ago

I did a test where I changed the break to a continue and my text output definitely looks better

{"text": "The same neocons who persuaded George W. Bush and crew to, in Ron Paul\u2019s inimitable words, \u201clie their way into invading Iraq\u201d in 2003, are beating the drums of war more loudly these days to attack Iran. It is remarkable how many of these war-mongers are former draft dodgers who wanted other Americans to fight the war in Vietnam.\n\nWith the exception of Ron Paul, who actually knows the history of US-Iranian relations, the Republican presidential contenders have declared their belligerency toward Iranian officials who they accuse of moving toward nuclear weapons.\n\nWhile many western and some Arab countries in the Gulf region have condemned Iran\u2019s alleged nuclear arms quest, Israel maintains some 200 ready nuclear weapons and has refused to sign the non-proliferation treaty, thereby avoiding the IAEA inspectors.\n\nIsraelis in the know have much to say. Defense minister, Ehud Barak
GeorgiosSmyrnis commented 4 months ago

Hi @humzaiqbal ,

My apologies for the late response - I am looking into this issue and will update you when I have a solution. It seems the underlying problem is the exact dedup step not being run properly.

Can you point me to the file that causes this?

Thanks!

GeorgiosSmyrnis commented 4 months ago

@humzaiqbal I think I actually identified the issue and pushed a small fix in #23 - if you are using the HF provided data then this should make the entire process work without changes.

Please let me know if there are still issues!

humzaiqbal commented 3 months ago

Gotcha thanks! One question, while I was waiting for this to be resolved I wound up making a fix myself which seemed to get the processed files into the right format and finish up processing by changing the break to a continue. What kind of impacts might this have? I'm wondering if I need to rerun processing

GeorgiosSmyrnis commented 3 months ago

I think that changing that break to continue essentially skips the exact_dedup steps defined in the yaml file (so the exact duplicate filters are not applied). I believe that this does affect results, unfortunately - sorry for the inconvenience!

humzaiqbal commented 3 months ago

Gotcha makes sense, thanks will close this issue since the original problem is resolved!