Open Ivan-Zhou opened 1 month ago
A few other notes:
fineweb_data
and dclm
i'm really confused. tests pass locally and i was able to run a job to completion. Can you download /tmp/ray/session_latest/logs/
I am using dlwh/fineweb_llama_txt
and it fails when the number of files are very high (for fineweb).
Also, I'm using shuffle buffer of 100000. This is the error I'm getting:
These links might be helpful:
Maybe we can just spill to GCP or some other store instead of using TPUs?
[2024-07-26 19:42:17,659 I 2062 2062] (raylet) local_object_manager.cc:245: :info_message:Spilled 1530 MiB, 1999 objects, write throughput 586 MiB/s.
[2024-07-26 19:42:17,664 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,764 I 2062 2062] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-07-26 19:42:17,783 I 2062 2062] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe
[state-dump] Node name: 10.130.3.18
[state-dump] InitialConfigResources: {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:__internal_head__: 10000, memory: 3816690937850000,
node:10.130.3.18: 10000, TPU-v4-256-head: 10000, object_store_memory: 326417475580000, TPU: 40000, CPU: 2400000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 896e4bb4773427857862284c837792ebc366061252134ea14bf51abe =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 7369852251879098477 Local resources: {"total":{memory: [3816690937850000], object_store_memory: [326417475580000], node:10.130.3.18: [10000], n
ode:__internal_head__: [10000], abhinavg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000]
, CPU: [2400000]}}, "available": {memory: [3816690937850000], object_store_memory: [80228517160000], node:10.130.3.18: [10000], node:__internal_head__: [10000], abhin
avg-fw-txt-preempt-v4-256: [10000], accelerator_type:TPU-V4: [10000], TPU: [10000, 10000, 10000, 10000], TPU-v4-256-head: [10000], CPU: [1915000]}}, "labels":{"ray.io
/node_id":"896e4bb4773427857862284c837792ebc366061252134ea14bf51abe",} is_draining: 0 is_idle: 0 Cluster resources: node id: 2657070934577218762{"total":{TPU: 40000,
CPU: 2400000, object_store_memory: 326417475580000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3925625952250000, accelerator_type:TPU-V4: 10000, node:10.130.3.34:
10000}}, "available": {accelerator_type:TPU-V4: 10000, memory: 3925625952250000, object_store_memory: 293687982380000, CPU: 1795000, TPU: 40000, node:10.130.3.34: 10
000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"d6c1845c76582eee006857ae81b0ef895deabaaf0b3e2b122207d24e",}, "is_draining": 0, "draining_dea
dline_timestamp_ms": -1}node id: 305753936840358228{"total":{accelerator_type:TPU-V4: 10000, node:10.130.3.26: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_st
ore_memory: 326417475580000, CPU: 2400000, TPU: 40000, memory: 3925650733050000}}, "available": {node:10.130.3.26: 10000, CPU: 2275000, TPU: 40000, object_store_memor
y: 97949744960000, memory: 3925650733050000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"3eec0bb7d611e7aad286
1a1efdf26a37a9c3d6e960425f200761470d",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 1294823379791528053{"total":{TPU: 40000, object_store_memory:
326417475580000, node:10.130.2.207: 10000, memory: 3925623003130000, abhinavg-fw-txt-preempt-v4-256: 10000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "availabl
e": {memory: 3925623003130000, accelerator_type:TPU-V4: 10000, object_store_memory: 298065983560000, node:10.130.2.207: 10000, TPU: 40000, CPU: 1795000, abhinavg-fw-t
xt-preempt-v4-256: 10000}}, "labels":{"ray.io/node_id":"2f831786398540a40003d66664592345bd68e714da3e718e1cef1df5",}, "is_draining": 0, "draining_deadline_timestamp_ms
": -1}node id: 8300664650656660378{"total":{node:10.130.3.30: 10000, memory: 3925644548090000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 32641747558
0000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available": {CPU: 2275000, TPU: 40000, memory: 3925644548090000, accelerator_type:TPU-V4: 10000, ab
hinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.30: 10000, object_store_memory: 78596266190000}}, "labels":{"ray.io/node_id":"6389a432307b5cb2dd0deca0dbac5b6695695
486218e47b86d678dce",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 7369852251879098477{"total":{memory: 3816690937850000, node:10.130.3.18: 10000
, object_store_memory: 326417475580000, TPU-v4-256-head: 10000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:
__internal_head__: 10000}}, "available": {object_store_memory: 80228517160000, TPU-v4-256-head: 10000, CPU: 1915000, node:10.130.3.18: 10000, accelerator_type:TPU-V4:
10000, TPU: 40000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3816690937850000, node:__internal_head__: 10000}}, "labels":{"ray.io/node_id":"896e4bb4773427857862
284c837792ebc366061252134ea14bf51abe",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 6093590469938625960{"total":{memory: 3925636888570000, node:1
0.130.3.42: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, object_store_memory: 326417475580000, CPU: 2400000, TPU: 40000, accelerator_type:TPU-V4: 10000}}, "available
": {accelerator_type:TPU-V4: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, node:10.130.3.42: 10000, TPU: 40000, memory: 3925636888570000, object_store_memory: 1921516
57400000, CPU: 2035000}}, "labels":{"ray.io/node_id":"d890a7e9a2827c5a73f42f3cbf5c463c528912f9a00d7b2b2790dcda",}, "is_draining": 0, "draining_deadline_timestamp_ms":
-1}node id: -5257004634468912003{"total":{TPU: 40000, object_store_memory: 326417475580000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 10000, memory: 3
925643810810000, accelerator_type:TPU-V4: 10000, CPU: 2400000}}, "available": {TPU: 40000, CPU: 2395000, node:10.130.3.46: 10000, abhinavg-fw-txt-preempt-v4-256: 1000
...skipping...
[2024-07-26 19:42:19,652 E 2062 2062] (raylet) local_object_manager.cc:243: :info_message:Spilled 7641 MiB, 9998 objects, write throughput 1660 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
sure spilling to GCS sounds good. want to try that?
Kiloshard would have fixed this probably actually
(really we just need better back pressure)
tests/test_tokenized_document_cache.py
fails attest_doc_cache_reproduces_data_one_batch_per_shard
(see below);Below is log from the unit test: