Ray Cluster Error at tokenizing documents

Ivan-Zhou commented 1 month ago

The tests/test_tokenized_document_cache.py fails at test_doc_cache_reproduces_data_one_batch_per_shard (see below);
The same error happened on training jobs on dataset without existing document cache, regardless of the # files: config, job.

Below is log from the unit test:

(ChunkCacheBuilder pid=69299) 2024-07-21 17:11:04,538 - levanter.data.shard_cache.builder::tmpaadrd2z7/cache - INFO - Starting cache build for 10 shards
(ChunkCacheBroker pid=69263) 2024-07-21 17:11:01,152 - levanter.data.shard_cache - INFO - Finalizing cache /var/folders/jm/jpc4s6kn3t98gt3rtmrtjmjm0000gn/T/tmpbwjrn1_1/cache...
(ChunkCacheBuilder pid=69270) 2024-07-21 17:11:01,146 - levanter.data.shard_cache - INFO - Shard 0 finished
2024-07-21 17:11:11,335 ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
2024-07-21 17:11:11,341 ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ivan/dev/levanter/src/levanter/data/shard_cache.py", line 1076, in finished_sentinel
    await self._finished_sentinel
ray.exceptions.OwnerDiedError: Failed to retrieve object 3bc604631e19bb66ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs`) for more information about the Python worker failure.
============================================================================================================ warnings summary ============================================================================================================
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[1]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[2]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[3]
tests/test_tokenized_document_cache.py::test_doc_cache_reproduces_data_multi_docs_per_batch_sharded[8]
tests/test_tokenized_document_cache.py::test_doc_cache_sharding
  /Users/ivan/dev/miniconda3/envs/levanter/lib/python3.10/site-packages/dataclasses_json/core.py:189: RuntimeWarning: 'NoneType' object value of non-optional type metadata detected when decoding CacheLedger.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================== short test summary info =========================================================================================================
FAILED test_tokenized_document_cache.py::test_doc_cache_reproduces_data_one_batch_per_shard - ray.exceptions.RayTaskError(OwnerDiedError): ray::ChunkCacheBroker.finished_sentinel() (pid=69291, ip=127.0.0.1, actor_id=11422f45e28b6d442798cc5b01000000, repr=<levanter.data.shard_cache.ChunkCacheBroker object at 0x13726c490>)
=========================================================================================== 1 failed, 8 passed, 5 warnings in 73.16s (0:01:13) ============================================================

abhinavg4 commented 1 month ago

A few other notes:

Branch to try this issue on : fineweb_data and dclm
The issue might be external to levanter as I had processed a dataset(fineweb_md) earlier with the same code.

dlwh commented 1 month ago

i'm really confused. tests pass locally and i was able to run a job to completion. Can you download /tmp/ray/session_latest/logs/

abhinavg4 commented 1 month ago

I am using dlwh/fineweb_llama_txt and it fails when the number of files are very high (for fineweb).

Also, I'm using shuffle buffer of 100000. This is the error I'm getting:

These links might be helpful:

https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html

Maybe we can just spill to GCP or some other store instead of using TPUs?

[2024-07-26 19:42:17,659 I 2062 2062] (raylet) local_object_manager.cc:245: :info_message:Spilled 1530 MiB, 1999 objects, write throughput 586 MiB/s.
[2024-07-26 19:42:17,664 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,764 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,783 I 2062 2062] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe
[state-dump] Node name: 10.130.3.18
[state-dump] InitialConfigResources: {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:__internal_head__: 10000, memory: 3816690937850000,
node:10.130.3.18: 10000, TPU-v4-256-head: 10000, object_store_memory: 326417475580000, TPU: 40000, CPU: 2400000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 7369852251879098477 Local resources: {"total":{memory: [3816690937850000], object_store_memory: [326417475580000], node:10.130.3.18: [10000], n
ode:__internal_head__: [10000], abhinavg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000]
, CPU: [2400000]}}, "available": {memory: [3816690937850000], object_store_memory: [80228517160000], node:10.130.3.18: [10000], node:__internal_head__: [10000], abhin
avg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000], CPU: [1915000]}}, "labels":{"ray.io
/node_id":"896e4bb4773427857862284c837792ebc366061252134ea14bf51abe",} is_draining: 0 is_idle: 0 Cluster resources: node id: 2657070934577218762{"total":{TPU: 40000,
CPU: 2400000, object_store_memory: 326417475580000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3925625952250000, accelerator_type:TPU-V4: 10000, node:10.130.3.34:
 10000}}, "available": {accelerator_type:TPU-V4: 10000, memory: 3925625952250000, object_store_memory: 293687982380000, CPU: 1795000, TPU: 40000, node:10.130.3.34: 10
000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"d6c1845c76582eee006857ae81b0ef895deabaaf0b3e2b122207d24e",}, "is_draining": 0, "draining_dea
dline_timestamp_ms": -1}node id: 305753936840358228{"total":{accelerator_type:TPU-V4: 10000, node:10.130.3.26: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_st
ore_memory: 326417475580000, CPU: 2400000, TPU: 40000, memory: 3925650733050000}}, "available": {node:10.130.3.26: 10000, CPU: 2275000, TPU: 40000, object_store_memor
y: 97949744960000, memory: 3925650733050000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"3eec0bb7d611e7aad286
1a1efdf26a37a9c3d6e960425f200761470d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 1294823379791528053{"total":{TPU: 40000, object_store_memory:
 326417475580000, node:10.130.2.207: 10000, memory: 3925623003130000, abhinavg-fw-txt-preempt-v4-256: 10000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "availabl
e": {memory: 3925623003130000, accelerator_type:TPU-V4: 10000, object_store_memory: 298065983560000, node:10.130.2.207: 10000, TPU: 40000, CPU: 1795000, abhinavg-fw-t
xt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"2f831786398540a40003d66664592345bd68e714da3e718e1cef1df5",}, "is_draining": 0, "draining_deadline_timestamp_ms
": -1}node id: 8300664650656660378{"total":{node:10.130.3.30: 10000, memory: 3925644548090000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 32641747558
0000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available": {CPU: 2275000, TPU: 40000, memory: 3925644548090000, accelerator_type:TPU-V4: 10000, ab
hinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.30: 10000, object_store_memory: 78596266190000}}, "labels":{"ray.io/node_id":"6389a432307b5cb2dd0deca0dbac5b6695695
486218e47b86d678dce",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 7369852251879098477{"total":{memory: 3816690937850000, node:10.130.3.18: 10000
, object_store_memory: 326417475580000, TPU-v4-256-head: 10000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:
__internal_head__: 10000}}, "available": {object_store_memory: 80228517160000, TPU-v4-256-head: 10000, CPU: 1915000, node:10.130.3.18: 10000, accelerator_type:TPU-V4:
 10000, TPU: 40000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3816690937850000, node:__internal_head__: 10000}}, "labels":{"ray.io/node_id":"896e4bb4773427857862
284c837792ebc366061252134ea14bf51abe",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 6093590469938625960{"total":{memory: 3925636888570000, node:1
0.130.3.42: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 326417475580000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available
": {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.42: 10000, TPU: 40000, memory: 3925636888570000, object_store_memory: 1921516
57400000, CPU: 2035000}}, "labels":{"ray.io/node_id":"d890a7e9a2827c5a73f42f3cbf5c463c528912f9a00d7b2b2790dcda",}, "is_draining": 0, "draining_deadline_timestamp_ms":
 -1}node id: -5257004634468912003{"total":{TPU: 40000, object_store_memory: 326417475580000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3
925643810810000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "available": {TPU: 40000, CPU: 2395000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 1000
...skipping...
[2024-07-26 19:42:19,652 E 2062 2062] (raylet) local_object_manager.cc:243: :info_message:Spilled 7641 MiB, 9998 objects, write throughput 1660 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.

dlwh commented 1 month ago

sure spilling to GCS sounds good. want to try that?

dlwh commented 1 month ago

Kiloshard would have fixed this probably actually

dlwh commented 1 month ago

(really we just need better back pressure)

stanford-crfm / levanter

Ray Cluster Error at tokenizing documents #667