ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[Data] Add a dict to dataset with `add_column` and update it with `map`, but get wrong result. #42190

Closed zhijianma closed 8 months ago

zhijianma commented 8 months ago

What happened + What you expected to happen

  1. I use add_column to add a empty dict meta to dataset and then use map to update the meta dict, but get a wrong dataset.
    def test1():
    ds = ray.data.range(40)
    ds = ds.add_column('meta',lambda df: [{}] * len(df))
    def fn(sample):
        sample['meta']['id'] = sample['id']
        print(sample)
        return sample
    ds = ds.map(fn)
  2. Expected behaviour
    sample[0]['meta']['id'] == 0
    sample[1]['meta']['id'] == 1
    ...
    sample[38]['meta']['id'] == 38
    sample[39]['meta']['id'] == 39

    But I get :

    sample[0]['meta']['id'] == 1
    sample[1]['meta']['id'] == 1
    ...
    sample[38]['meta']['id'] == 39
    sample[39]['meta']['id'] == 39
  3. Log

RAY_DEDUP_LOGS=0 python test.py

2024-01-05 09:43:24,850 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2024-01-05 09:43:24,868 INFO worker.py:1642 -- Connected to Ray cluster.
2024-01-05 09:43:26,440 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->MapBatches(process_batch)->Map(fn)]
2024-01-05 09:43:26,441 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-05 09:43:26,441 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 0, 'meta': {'id': 0}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 1, 'meta': {'id': 1}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 20, 'meta': {'id': 20}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 21, 'meta': {'id': 21}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 22, 'meta': {'id': 22}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 23, 'meta': {'id': 23}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 24, 'meta': {'id': 24}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 25, 'meta': {'id': 25}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 26, 'meta': {'id': 26}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 27, 'meta': {'id': 27}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 28, 'meta': {'id': 28}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 29, 'meta': {'id': 29}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 30, 'meta': {'id': 30}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 31, 'meta': {'id': 31}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 32, 'meta': {'id': 32}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 33, 'meta': {'id': 33}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 34, 'meta': {'id': 34}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 35, 'meta': {'id': 35}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 36, 'meta': {'id': 36}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 37, 'meta': {'id': 37}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 38, 'meta': {'id': 38}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 39, 'meta': {'id': 39}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73955) {'id': 2, 'meta': {'id': 2}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73955) {'id': 3, 'meta': {'id': 3}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73963) {'id': 12, 'meta': {'id': 12}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73963) {'id': 13, 'meta': {'id': 13}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73966) {'id': 18, 'meta': {'id': 18}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73966) {'id': 19, 'meta': {'id': 19}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73962) {'id': 10, 'meta': {'id': 10}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73962) {'id': 11, 'meta': {'id': 11}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73959) {'id': 4, 'meta': {'id': 4}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73959) {'id': 5, 'meta': {'id': 5}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73961) {'id': 8, 'meta': {'id': 8}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73961) {'id': 9, 'meta': {'id': 9}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73960) {'id': 6, 'meta': {'id': 6}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73960) {'id': 7, 'meta': {'id': 7}}                                                                    
test1 ds =  [{'id': 0, 'meta': {'id': 1}}, {'id': 1, 'meta': {'id': 1}}, {'id': 20, 'meta': {'id': 21}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 23}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 25}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 27}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 29}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 31}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 33}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 35}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 37}}, {'id': 37, 'meta': {'id': 37}}, {'id': 2, 'meta': {'id': 3}}, {'id': 3, 'meta': {'id': 3}}, {'id': 38, 'meta': {'id': 39}}, {'id': 39, 'meta': {'id': 39}}, {'id': 10, 'meta': {'id': 11}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 13}}, {'id': 13, 'meta': {'id': 13}}, {'id': 18, 'meta': {'id': 19}}, {'id': 19, 'meta': {'id': 19}}, {'id': 4, 'meta': {'id': 5}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 7}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 9}}, {'id': 9, 'meta': {'id': 9}}, {'id': 14, 'meta': {'id': 15}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 17}}, {'id': 17, 'meta': {'id': 17}}]
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73965) {'id': 16, 'meta': {'id': 16}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73965) {'id': 17, 'meta': {'id': 17}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73964) {'id': 14, 'meta': {'id': 14}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73964) {'id': 15, 'meta': {'id': 15}}
2024-01-05 09:43:29,506 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Map(fn)]
2024-01-05 09:43:29,506 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-05 09:43:29,506 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
test2 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}]

Versions / Dependencies

Package Version Editable project location


absl-py 1.3.0 accelerate 0.20.3 addict 2.4.0 aie-ipyleaflet 0.15.1 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 alabaster 0.7.12 albumentations 1.3.0 alembic 1.11.1 aliyun-python-sdk-core 2.13.36 aliyun-python-sdk-kms 2.16.0 altair 4.2.2 anaconda-client 1.11.0 anaconda-navigator 2.3.1 anaconda-project 0.11.1 anyio 3.6.2 appdirs 1.4.4 applaunchservices 0.3.0 appnope 0.1.2 appscript 1.1.2 APScheduler 3.10.1 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 array-record 0.4.0 arrow 1.2.2 astor 0.8.1 astroid 2.11.7 astropy 5.1 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.2 atomicwrites 1.4.0 attrs 23.1.0 audioread 3.0.0 Automat 20.2.0 autopage 0.5.1 autopep8 1.6.0 av 10.0.0 Babel 2.9.1 backcall 0.2.0 backports.functools-lru-cache 1.6.4 backports.tempfile 1.0 backports.weakref 1.0.post1 base58 2.1.1 bcrypt 3.2.0 beautifulsoup4 4.11.1 binaryornot 0.4.4 bitarray 2.5.1 bitsandbytes 0.38.0 bkcharts 0.2 black 22.3.0 bleach 4.1.0 blinker 1.6.2 blis 0.7.9 bokeh 2.4.3 boltons 23.0.0 boto3 1.16.49 botocore 1.19.63 Bottleneck 1.3.5 Brotli 1.0.9 brotlipy 0.7.0 cachetools 5.2.0 catalogue 2.0.8 certifi 2022.12.7 cffi 1.15.1 cfgv 3.3.1 chardet 4.0.0 charset-normalizer 3.1.0 chex 0.1.7 click 8.1.3 cliff 4.3.0 cloudpickle 2.0.0 clu 0.0.9 clyent 1.2.2 cmaes 0.9.1 cmd2 2.4.3 codecarbon 2.2.3 colorama 0.4.5 colorcet 3.0.0 coloredlogs 15.0.1 colorlog 6.7.0 commonmark 0.9.1 conda 23.3.1 conda-build 3.22.0 conda-content-trust 0.1.3 conda-pack 0.6.0 conda-package-handling 1.9.0 conda-repo-cli 1.0.20 conda-token 0.4.0 conda-verify 3.4.2 confection 0.0.3 constantly 15.1.0 contextlib2 21.6.0 contourpy 1.0.7 cookiecutter 1.7.3 courlan 0.9.3 crcmod 1.7 cryptography 38.0.1 cssselect 1.1.0 cycler 0.11.0 cymem 2.0.7 Cython 0.29.32 cytoolz 0.11.0 daal4py 2021.6.0 dask 2022.7.0 data-juicer 0.1.0 dataclasses 0.6 datasets 2.11.0 datashader 0.14.1 datashape 0.5.4 datasketch 1.5.9 dateparser 1.1.8 debugpy 1.5.1 decorator 4.4.2 defusedxml 0.7.1 descartes 1.1.0 diff-match-patch 20200713 diffusers 0.16.1 dill 0.3.4 distlib 0.3.6 distributed 2022.7.0 dlib 19.24.2 dm-tree 0.1.8 docopt 0.6.2 docstring-parser 0.15 docutils 0.18.1 easydict 1.10 editdistance 0.6.2 einops 0.6.1 embeddings 0.0.8 emoji 2.2.0 en-core-web-md 3.5.0 entrypoints 0.4 et-xmlfile 1.1.0 etils 1.3.0 evaluate 0.3.0 exceptiongroup 1.1.2 executing 2.0.1 fairscale 0.4.12 Faker 18.9.0 fastapi 0.95.1 fastcore 1.5.27 fastdownload 0.0.7 fastjsonschema 2.16.2 fastprogress 1.0.3 fasttext 0.9.2 ffmpeg 1.4 ffmpeg-python 0.2.0 ffmpy 0.3.0 filelock 3.11.0 fire 0.4.0 flake8 4.0.1 Flask 1.1.2 flatbuffers 2.0.7 flax 0.6.11 fonttools 4.39.3 frozendict 2.3.8 frozenlist 1.3.3 fsspec 2023.3.0 ftfy 6.1.1 future 0.18.2 fuzzywuzzy 0.18.0 gast 0.4.0 gdown 4.7.1 gensim 4.1.2 gin-config 0.5.0 gitdb 4.0.10 GitPython 3.1.31 glob2 0.7 gmpy2 2.1.2 google-auth 2.21.0 google-auth-oauthlib 1.0.0 google-pasta 0.2.0 googleapis-common-protos 1.59.1 gradio 3.35.2 gradio_client 0.2.7 graphviz 0.20.1 greenlet 1.1.1 grpcio 1.50.0 h11 0.14.0 h5py 3.7.0 harvesttext 0.8.1.8 HeapDict 1.0.1 hjson 3.1.0 holoviews 1.15.0 htmldate 1.4.3 httpcore 0.17.0 httpx 0.24.0 huggingface-hub 0.15.1 humanfriendly 10.0 hvplot 0.8.0 hyperlink 21.0.0 hypothesis 6.80.0 identify 2.5.5 idna 3.4 imagecodecs 2021.8.26 imagededup 0.3.2 imageio 2.9.0 imageio-ffmpeg 0.4.7 imagesize 1.4.1 imgaug 0.4.0 immutabledict 2.2.4 importlib 1.0.4 importlib-metadata 4.11.3 importlib-resources 5.12.0 incremental 21.3.0 inflate64 0.3.1 inflection 0.5.1 iniconfig 1.1.1 intake 0.6.5 internetarchive 3.5.0 intervaltree 3.1.0 ipadic 1.0.0 ipykernel 6.15.2 ipython 8.18.1 ipython-genutils 0.2.0 ipywidgets 7.6.5 isodate 0.6.1 isort 4.3.21 itemadapter 0.3.0 itemloaders 1.0.4 itsdangerous 2.0.1 jax 0.3.25 jaxlib 0.3.25 jdcal 1.4.1 jedi 0.18.1 jellyfish 0.9.0 jieba 0.42.1 Jinja2 3.1.2 jinja2-time 0.2.0 jiwer 2.2.0 jmespath 0.10.0 joblib 1.2.0 json-tricks 3.16.1 json5 0.9.6 jsonargparse 4.21.1 jsonlines 3.1.0 jsonpatch 1.32 jsonplus 0.8.0 jsonpointer 2.1 jsonschema 4.17.3 jupyter 1.0.0 jupyter_client 7.3.4 jupyter-console 6.4.3 jupyter_core 4.11.1 jupyter-server 1.18.1 jupyterlab 3.4.4 jupyterlab-pygments 0.1.2 jupyterlab-server 2.10.3 jupyterlab-widgets 1.0.0 just-testsimhash-pybind 0.0.1 jusText 3.0.0 kaleido 0.2.1 kenlm 0.0.0 keras 2.12.0 keyring 23.4.0 kiwisolver 1.4.4 kornia 0.6.8 langcodes 3.3.0 langid 1.1.6 lazy-object-proxy 1.6.0 Levenshtein 0.21.1 libarchive-c 2.9 libclang 16.0.0 librosa 0.8.0 linkify-it-py 2.0.0 livereload 2.6.3 llvmlite 0.39.1 lmdb 1.3.0 locket 1.0.0 loguru 0.5.3 lpips 0.1.4 ltp 4.2.13 ltp-core 0.1.4 ltp-extension 0.1.10 lxml 4.9.2 lz4 3.1.3 Mako 1.2.4 Markdown 3.3.4 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mccabe 0.6.1 mdit-py-plugins 0.3.3 mdurl 0.1.2 megatron-util 1.3.2 mesh-tensorflow 0.1.21 mistune 0.8.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 ml-collections 0.1.1 ml-datasets 0.2.0 ml-dtypes 0.2.0 mmcls 0.24.1 mmdet 2.25.3 mock 2.0.0 modelscope 1.9.5 moviepy 1.0.3 mpmath 1.2.1 msgpack 1.0.3 multidict 6.0.4 multipledispatch 0.6.0 multiprocess 0.70.12 multivolumefile 0.2.3 munkres 1.1.4 murmurhash 1.0.9 mypy 1.0.1 mypy-extensions 0.4.3 navigator-updater 0.3.0 nbclassic 0.3.5 nbclient 0.5.13 nbconvert 6.4.4 nbformat 5.5.0 nest-asyncio 1.5.5 networkx 2.8.4 nh3 0.2.15 ninja 1.11.1 nlpaug 1.1.11 nltk 3.5 nodeenv 1.7.0 nose 1.3.7 notebook 6.4.12 numba 0.56.4 numexpr 2.8.3 numpy 1.23.5 numpydoc 1.4.0 nuscenes-devkit 1.1.9 oauthlib 3.2.2 olefile 0.46 onnxruntime 1.13.1 OpenCC 1.1.6 opencc-python-reimplemented 0.1.7 opencv-python 4.6.0.66 opencv-python-headless 4.6.0.66 openpyxl 3.0.10 opt-einsum 3.3.0 optax 0.1.5 optuna 2.10.0 orjson 3.8.10 oss2 2.16.0 packaging 23.2 pai-easycv 0.7.0 pandas 2.0.0 pandocfilters 1.5.0 panel 0.13.1 param 1.12.0 parsel 1.6.0 parso 0.8.3 partd 1.2.0 pathlib 1.0.1 pathspec 0.9.0 pathy 0.10.2 patsy 0.5.2 pbr 5.11.1 pdfminer 20191125 pdfminer.six 20221105 pdfplumber 0.9.0 pep8 1.7.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.5.0 pip 23.1.2 pkginfo 1.8.2 platformdirs 2.5.2 plotly 5.14.1 pluggy 1.0.0 ply 3.11 pooch 1.7.0 portalocker 2.7.0 poyo 0.5.0 pre-commit 3.2.1 preshed 3.0.8 prettytable 3.5.0 proglog 0.1.10 prometheus-client 0.14.1 promise 2.3 prompt-toolkit 3.0.41 Protego 0.1.16 protobuf 3.20.3 psutil 5.9.0 psycopg2 2.8.6 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 9.0.0 py-data-juicer 0.1.2 /Users/mazhijian/Documents/Project_2023/P01_LLM/C02_Solutions/data-juicer py4j 0.10.9.7 py7zr 0.20.5 pyarrow 12.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybcj 1.0.1 pybind11 2.10.4 pyclipper 1.3.0.post4 pycocotools 2.0.6 pycodestyle 2.8.0 pycosat 0.6.3 pycparser 2.21 pycryptodome 3.15.0 pycryptodomex 3.18.0 pyct 0.4.8 pycurl 7.45.1 pydantic 1.7.4 pydeck 0.8.1b0 PyDispatcher 2.0.5 pydocstyle 6.1.1 pydub 0.25.1 pyerfa 2.0.0 pyflakes 2.4.0 pyglove 0.3.0 Pygments 2.15.1 PyHamcrest 2.0.2 PyJWT 2.6.0 pylint 2.14.5 pyls-spyder 0.4.0 pyltp 0.4.0 Pympler 1.0.1 pynvml 11.5.0 pyobjc-core 8.5 pyobjc-framework-Cocoa 8.5 pyobjc-framework-CoreServices 8.5 pyobjc-framework-FSEvents 8.5 pyodbc 4.0.34 pyOpenSSL 22.0.0 pyparsing 3.0.9 pyperclip 1.8.2 pypinyin 0.49.0 pyplumber 0.1.9 pyppmd 1.0.0 PyQt5-sip 12.11.0 pyquaternion 0.9.9 pyrsistent 0.19.3 PySocks 1.7.1 pyspark 3.4.0 pytest 7.1.2 pytest-timeout 1.4.2 pythainlp 4.0.2 python-crfsuite 0.9.9 python-dateutil 2.8.2 python-docx 0.8.11 python-Levenshtein 0.21.1 python-louvain 0.16 python-lsp-black 1.2.1 python-lsp-jsonrpc 1.0.0 python-lsp-server 1.5.0 python-multipart 0.0.6 python-pptx 0.6.21 python-slugify 8.0.1 python-snappy 0.6.0 pytorch-metric-learning 1.6.3 pytz 2023.3 pytz-deprecation-shim 0.1.0.post0 pyvi 0.1.1 pyviz-comms 2.0.2 PyWavelets 1.3.0 PyYAML 5.4.1 pyzmq 23.2.0 pyzstd 0.15.9 QDarkStyle 3.0.2 qstylizer 0.1.10 QtAwesome 1.0.3 qtconsole 5.3.2 QtPy 2.2.0 qudida 0.0.4 queuelib 1.5.0 rapidfuzz 2.13.2 ray 2.7.1 rdflib 6.3.2 readme-renderer 42.0 recommonmark 0.7.1 redis 4.5.5 regex 2022.7.9 requests 2.28.2 requests-file 1.5.1 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 resampy 0.4.2 responses 0.18.0 rfc3986 2.0.0 rich 13.3.5 rope 0.22.0 rouge 1.0.1 rouge-score 0.1.2 rsa 4.9 Rtree 0.9.7 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 ruamel-yaml-conda 0.15.100 s3transfer 0.3.7 sacrebleu 2.0.0 sacremoses 0.0.53 safetensors 0.4.0 schema 0.7.5 scikit-image 0.19.3 scikit-learn 1.2.2 scikit-learn-intelex 2021.20221004.121333 scipy 1.11.3 Scrapy 2.6.2 seaborn 0.11.2 selectolax 0.3.13 semantic-version 2.10.0 Send2Trash 1.8.0 sentencepiece 0.1.95 seqeval 1.2.2 seqio 0.0.16 seqio-nightly 0.0.15.dev20230702 service-identity 18.1.0 setuptools 68.0.0 Shapely 1.8.5.post1 shotdetect-scenedetect-lgss 0.0.3 simhash-py 0.4.2 simhash-pybind 0.0.2 simplejson 3.18.0 sip 6.6.2 six 1.16.0 sklearn 0.0.post1 sklearn-crfsuite 0.3.6 smart-open 5.2.1 smmap 5.0.0 sniffio 1.3.0 snowballstemmer 2.2.0 sortedcollections 2.1.0 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.3.1 spacy 3.5.0 spacy-legacy 3.0.12 spacy-loggers 1.0.4 spacy-pkuseg 0.0.32 Sphinx 5.0.2 sphinx-autobuild 2021.3.14 sphinx-rtd-theme 1.2.2 sphinxcontrib-applehelp 1.0.2 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.0 sphinxcontrib-jquery 4.1 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 spyder 5.3.3 spyder-kernels 2.3.3 SQLAlchemy 1.4.39 srsly 2.4.5 stack-data 0.6.3 stanza 1.7.0 starlette 0.26.1 statsmodels 0.13.2 stevedore 5.1.0 streamlit 1.25.0 subword-nmt 0.3.8 sympy 1.10.1 t5 0.9.4 tables 3.6.1 tabulate 0.8.10 TBB 0.2 tblib 1.7.0 tenacity 8.2.2 tensorboard 2.12.3 tensorboard-data-server 0.7.1 tensorboard-plugin-wit 1.8.1 tensorflow-datasets 4.9.2 tensorflow-estimator 2.12.0 tensorflow-hub 0.13.0 tensorflow-io-gcs-filesystem 0.32.0 tensorflow-metadata 1.13.1 tensorflow-text 2.12.1 tensorstore 0.1.40 termcolor 2.1.0 terminado 0.13.1 terminaltables 3.1.10 testpath 0.6.0 text-unidecode 1.3 textdistance 4.2.1 texttable 1.6.7 tf-slim 1.1.0 tfds-nightly 4.9.2.dev202307030045 thinc 8.1.10 thinc-apple-ops 0.1.3 thop 0.1.1.post2209072238 threadpoolctl 2.2.0 three-merge 0.1.1 tifffile 2021.7.2 timm 0.6.11 tinycss 0.4 tld 0.13 tldextract 3.2.0 tokenizers 0.13.3 toml 0.10.2 tomli 1.2.3 tomlkit 0.11.1 toolz 0.12.0 torch 2.1.1 torch-struct 0.5 torchmetrics 0.10.3 torchvision 0.16.1 tornado 6.1 tqdm 4.66.1 trafilatura 1.6.0 traitlets 5.1.1 traittypes 0.2.1 trankit 1.1.1 transformers 4.31.0 twine 4.0.2 Twisted 22.2.0 typer 0.7.0 types-mock 5.0.0.7 types-requests 2.31.0.1 types-setuptools 68.0.0.0 types-urllib3 1.26.25.13 typeshed-client 2.3.0 typing 3.7.4.3 typing_extensions 4.5.0 tzdata 2023.3 tzlocal 4.3 uc-micro-py 1.0.1 ujson 5.4.0 ukkonen 1.0.1 Unidecode 1.2.0 urllib3 1.26.15 uvicorn 0.21.1 validators 0.20.0 virtualenv 20.17.1 w3lib 1.21.0 Wand 0.6.11 wasabi 0.10.1 watchdog 2.1.6 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 0.58.0 websockets 11.0.1 Werkzeug 2.0.3 wget 3.2 whatthepatch 1.0.2 wheel 0.40.0 widgetsnbextension 3.5.2 wrapt 1.14.1 wurlitzer 3.0.2 xarray 0.20.1 xgboost 1.5.2 xlrd 2.0.1 XlsxWriter 3.0.3 xlwings 0.27.15 xtcocotools 1.12 xxhash 3.1.0 xyzservices 2022.9.0 yacs 0.1.8 yapf 0.31.0 yarl 1.8.2 zh-core-web-md 3.5.0 zhconv 1.4.3 zhon 1.1.5 zict 2.1.0 zipp 3.8.0 zope.interface 5.4.0 zstandard 0.21.0

Reproduction script

Source code in test.py

import ray
ray.init()

# The Result is Wrong.
def test1():
    ds = ray.data.range(40)
    ds = ds.add_column('meta',lambda df: [{}] * len(df))

    def fn(sample):
        sample['meta']['id'] = sample['id']
        print(sample)
        return sample
    ds = ds.map(fn)
    print('test1 ds = ', ds.take_all())

# The Result is Correct.
def test2():
    ds = ray.data.range(40)

    def fn(sample):
        if 'meta' not in sample:
            sample['meta'] = {}
        sample['meta']['id'] = sample['id']
        return sample
    ds = ds.map(fn)
    print('test2 ds = ', ds.take_all())

test1()
test2()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

scottjlee commented 8 months ago

This is because in this line:

ds = ds.add_column('meta',lambda df: [{}] * len(df))

the same dict object is used in for each element in the resulting list. After updating this to:

ds = ds.add_column('meta',lambda df: [{} for _ in range(len(df))])

I get the following expected result:

test1 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}]
2024-01-08 14:53:37,353 INFO set_read_parallelism.py:115 -- Using autodetected parallelism=20 for stage ReadRange to satisfy parallelism at least twice the available number of CPUs (10).
2024-01-08 14:53:37,353 INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Map(fn)]
2024-01-08 14:53:37,353 INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-08 14:53:37,353 INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
test2 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}]

Please feel free to re-open the issue if I missed anything.

zhijianma commented 8 months ago

@scottjlee Thank you so much. It works for me.