pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.5k stars 483 forks source link

Gemma finetuning on Kaggle TPU doesn't work #6610

Open windmaple opened 9 months ago

windmaple commented 9 months ago

🐛 Bug

Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.

trl seems to have an issue there.

To Reproduce

See my Kaggle notebook.

Expected behavior

Ideally it should run.

Environment

Stock Kaggle env.

Additional context

windmaple commented 9 months ago

OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.

JackCaoG commented 9 months ago

@alanwaketan

IsNoobgrammer commented 9 months ago

🐛 Bug

Not sure if this is a feature request or bug. I took the SPMD Gemma ft code from Hugging Face and tried to run it on Kaggle; it didn't work.

trl seems to have an issue there.

To Reproduce

See my Kaggle notebook.

Expected behavior

Ideally it should run.

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
  • torch_xla version:

Stock Kaggle env.

Additional context

Kaggle is using Older version of torch-xla where distributed.spmd is not implemented

OK, seems that code is for Cloud TPU only as mentioned in this HF blog. Then this is a feature request.

kaggle is using older version of torch-xla where torch.distributed.spmd was not implemented , would recommend to upgrade torch-xla

!pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html
alanwaketan commented 9 months ago

@windmaple You need to install the nightly torch-xla and torch.

windmaple commented 9 months ago

Kaggle VM just silently dies after upgrading torch and torch-xla

IsNoobgrammer commented 9 months ago

Kaggle VM just silently dies after upgrading torch and torch-xla

!pip uninstall -y tensorflow
!pip install tensorflow-cpu #optional
windmaple commented 9 months ago

It helped me get a little further with 2.2.0. But still,

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 42
     34 fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": [
     35         "GemmaDecoderLayer"
     36     ],
     37     "xla": True,
     38     "xla_fsdp_v2": True,
     39     "xla_fsdp_grad_ckpt": True}
     41 # Finally, set up the trainer and train the model.
---> 42 trainer = SFTTrainer(
     43     model=model,
     44     train_dataset=data,
     45     args=TrainingArguments(
     46         per_device_train_batch_size=64,  # This is actually the global batch size for SPMD.
     47         num_train_epochs=100,
     48         max_steps=-1,
     49         output_dir="./output",
     50         optim="adafactor",
     51         logging_steps=1,
     52         dataloader_drop_last = True,  # Required for SPMD.
     53         fsdp="full_shard",
     54         fsdp_config=fsdp_config,
     55     ),
     56     peft_config=lora_config,
     57     dataset_text_field="quote",
     58     max_seq_length=max_seq_length,
     59     packing=True,
     60 )
     62 trainer.train()

File /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:299, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs)
    293 if tokenizer.padding_side is not None and tokenizer.padding_side != "right":
    294     warnings.warn(
    295         "You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "
    296         "overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."
    297     )
--> 299 super().__init__(
    300     model=model,
    301     args=args,
    302     data_collator=data_collator,
    303     train_dataset=train_dataset,
    304     eval_dataset=eval_dataset,
    305     tokenizer=tokenizer,
    306     model_init=model_init,
    307     compute_metrics=compute_metrics,
    308     callbacks=callbacks,
    309     optimizers=optimizers,
    310     preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    311 )
    313 # Add tags for models that have been loaded with the correct transformers version
    314 if hasattr(self.model, "add_model_tags"):

File /usr/local/lib/python3.10/site-packages/transformers/trainer.py:653, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    649 if self.is_fsdp_xla_v2_enabled:
    650     # Prepare the SPMD mesh that is going to be used by the data loader and the FSDPv2 wrapper.
    651     # Tensor axis is just a placeholder where it will not be used in FSDPv2.
    652     num_devices = xr.global_runtime_device_count()
--> 653     xs.set_global_mesh(xs.Mesh(np.array(range(num_devices)), (num_devices, 1), axis_names=("fsdp", "tensor")))

AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'

What's the right way to install nightly? I searched around but couldn't find it.

alanwaketan commented 9 months ago

@windmaple Here is the instructions to install nightly: https://github.com/pytorch/xla#available-docker-images-and-wheels

PawKanarek commented 9 months ago

I had the same problem as @windmaple:

AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh'

As @alanwaketan suggested I installed nightly build of xla in fresh conda env with specified packages.

conda create -n v_xla python=3.10
conda activate v_xla
pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl
pip install datasets peft transformers trl
python train.py

Where train.py is this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py

Running this script results in the following error:

Traceback (most recent call last):
  File "/home/me/finetune/train.py", line 5, in <module>
    import torch_xla
  File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
    import _XLAC
ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE

I am looking for workarounds.

windmaple commented 9 months ago

@PawKanarek I'm stuck here too.

PawKanarek commented 9 months ago

To resolve this problem

ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE

You have to update pytorch to nightly

conda install pytorch-nightly::pytorch

But after this i got new problem

File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.

I found similar issues: https://github.com/google/gemma_pytorch/issues/25, https://github.com/Lightning-AI/pytorch-lightning/issues/18932

alanwaketan commented 9 months ago

@PawKanarek What's your libtpu version?

alanwaketan commented 9 months ago

@windmaple Yea, usually you just need nightly for both pytorch and pytorch/xla. pytorch/xla heavily depends on pytorch.

PawKanarek commented 9 months ago

@alanwaketan I think that my libtpu version is tpu-vm-pt-2.0, this is based on the command that I used to create my TPU v4-8.

gcloud compute tpus tpu-vm create my-tpu-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-pt-2.0

Oh, I see on documentation https://cloud.google.com/tpu/docs/supported-tpu-configurations#tpu_v4 that I should use tpu-vm-v4-pt-2.0. Thanks for the insight. ;)

alanwaketan commented 9 months ago

@PawKanarek libtpu is a pip pkg, you can grep it from pip list.

The latest version is:

pip list | grep libtpu
libtpu-nightly           0.1.dev20240213

If yours is older than this, you can update it via:

pip install torch-xla[tpuvm]
PawKanarek commented 9 months ago

I've installed this package

libtpu-nightly           0.1.dev20240213

and I still have the same

  File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.
alanwaketan commented 9 months ago

@PawKanarek Could be a hardware issue then... Can you try recreate a new TPU vm?

JackCaoG commented 9 months ago

tpu-vm-v4-pt-2.0 is a bit old image, do you mind following https://cloud.google.com/tpu/docs/run-calculation-pytorch to use vm version tpu-ubuntu2204-base. If the framrwork and libtpu version matched and it still doesn't work, it is usually usually the hardware issue or driver issue.

PawKanarek commented 9 months ago

I created new machine with command

 gcloud compute tpus tpu-vm create my-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-ubuntu2204-base

installed all required packages on and now when i try to run this script https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py I got this error:

 (v_xla) me@tpu-1:~/finetune$ python train.py 
Aborted (core dumped)

I will look for more specific errors :)

PawKanarek commented 9 months ago
This might be irrelevant
I managed to read the core dump file with `gdb` tool, but sadly I cannot find any specific errors. That's what `gdb`tool is showing me: -`bt`: Display the stack trace of the current thread ``` (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245 #6 0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=) at ../sysdeps/nptl/pthreadP.h:609 #7 __pthread_once_slow (once_control=, init_routine=0x7f9326cdac90 ) at ./nptl/pthread_once.c:114 #8 0x0000000000000000 in ?? () ``` `bt full`: Display the full stack trace ``` (gdb) bt full #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44 tid = ret = 0 pd = 0x7f9327654800 old_mask = {__val = {18446744073709551615, 140724683535936, 18446744073709551615, 18446744073709551615, 0, 10641313998539494912, 0, 140269997957756, 140269994049648, 140724683541272, 0, 0, 0, 0, 0, 0}} ret = pd = old_mask = ret = tid = ret = resultvar = resultvar = __arg3 = __arg2 = __arg1 = _a3 = _a2 = _a1 = __futex = resultvar = __arg3 = __arg2 = __arg1 = _a3 = _a2 = _a1 = __futex = __private = __oldval = result = #1 __pthread_kill_internal (signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:78 No locals. #2 __GI___pthread_kill (threadid=140269997869056, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 No locals. #3 0x00007f9327042476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 ret = #4 0x00007f93270287f3 in __GI_abort () at ./stdlib/abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 , 130843}}, sa_flags = 651013264, sa_restorer = 0x7ffd04c60d40} sigs = {__val = {32, 0 }} #5 0x00007f932765c38a in _Unwind_Resume (exc=0x5e5c200) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libgcc/unwind.inc:245 this_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d18, 0x0, 0x0, 0x7ffd04c60d40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60d20, --Type for more, q to quit, c to continue without paging--c 0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60d48, 0x0}, cfa = 0x7ffd04c60d50, ra = 0x7f93270298d5 , lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f932766aaf0 <_Unwind_Resume>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' } cur_context = {reg = {0x7ffd04c60d08, 0x7ffd04c60d10, 0x0, 0x7ffd04c60d90, 0x0, 0x0, 0x7ffd04c60d98, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd04c60da0, 0x7ffd04c60d28, 0x7ffd04c60d30, 0x7ffd04c60d38, 0x7ffd04c60da8, 0x0}, cfa = 0x7ffd04c60db0, ra = 0x0, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0, func = 0x7f93270298ac <__pthread_once_slow.cold>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' } code = frames = 140724683541616 #6 0x00007f93270298d5 in __pthread_cleanup_combined_routine (__frame=) at ../sysdeps/nptl/pthreadP.h:609 No locals. #7 __pthread_once_slow (once_control=, init_routine=0x7f9326cdac90 ) at ./nptl/pthread_once.c:114 __cancel_routine = 0x7f9327099f40 __clframe = {__cancel_routine = 0x7f9327099f40 , __cancel_arg = 0x7f926e5a6be8 , __do_it = 0, __buffer = {__routine = 0x0, __arg = 0x0, __canceltype = 0, __prev = 0x0}} val = newval = #8 0x0000000000000000 in ?? () No symbol table info available. ``` - `info threads` - List all threads. ``` (gdb) info threads Id Target Id Frame * 1 Thread 0x7f9327654800 (LWP 80077) __pthread_kill_implementation (no_tid=0, signo=6, threadid=140269997869056) at ./nptl/pthread_kill.c:44 2 Thread 0x7f930dbfe640 (LWP 80079) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcae60 ) at ./nptl/futex-internal.c:57 3 Thread 0x7f93093fd640 (LWP 80080) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcaee0 ) at ./nptl/futex-internal.c:57 4 Thread 0x7f927cbc4640 (LWP 80137) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccb60 ) at ./nptl/futex-internal.c:57 5 Thread 0x7f930e3ff640 (LWP 80078) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcade0 ) at ./nptl/futex-internal.c:57 6 Thread 0x7f93013f9640 (LWP 80084) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb0e0 ) at ./nptl/futex-internal.c:57 7 Thread 0x7f92fc3f7640 (LWP 80086) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb1e0 ) at ./nptl/futex-internal.c:57 8 Thread 0x7f92f9bf6640 (LWP 80087) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb260 ) at ./nptl/futex-internal.c:57 9 Thread 0x7f92f4bf4640 (LWP 80089) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb360 ) at ./nptl/futex-internal.c:57 10 Thread 0x7f92753c1640 (LWP 80140) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccce0 ) at ./nptl/futex-internal.c:57 11 Thread 0x7f92efbf2640 (LWP 80091) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb460 ) at ./nptl/futex-internal.c:57 12 Thread 0x7f92febf8640 (LWP 80085) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb160 ) at ./nptl/futex-internal.c:57 13 Thread 0x7f92e5bee640 (LWP 80095) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb660 ) at ./nptl/futex-internal.c:57 14 Thread 0x7f92e0bec640 (LWP 80097) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb760 ) at ./nptl/futex-internal.c:57 15 Thread 0x7f92dbbea640 (LWP 80099) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb860 ) at ./nptl/futex-internal.c:57 16 Thread 0x7f92d6be8640 (LWP 80101) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb960 ) at ./nptl/futex-internal.c:57 17 Thread 0x7f92d1be6640 (LWP 80103) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcba60 ) at ./nptl/futex-internal.c:57 18 Thread 0x7f92ccbe4640 (LWP 80105) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbb60 ) at ./nptl/futex-internal.c:57 19 Thread 0x7f92cf3e5640 (LWP 80104) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbae0 ) at ./nptl/futex-internal.c:57 20 Thread 0x7f92ca3e3640 (LWP 80106) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbbe0 ) at ./nptl/futex-internal.c:57 21 Thread 0x7f92c7be2640 (LWP 80107) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbc60 ) at ./nptl/futex-internal.c:57 22 Thread 0x7f92c2be0640 (LWP 80109) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbd60 ) at ./nptl/futex-internal.c:57 23 Thread 0x7f92c03df640 (LWP 80110) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, --Type for more, q to quit, c to continue without paging--c futex_word=0x7f9310dcbde0 ) at ./nptl/futex-internal.c:57 24 Thread 0x7f92b8bdc640 (LWP 80113) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbf60 ) at ./nptl/futex-internal.c:57 25 Thread 0x7f92bdbde640 (LWP 80111) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbe60 ) at ./nptl/futex-internal.c:57 26 Thread 0x7f92bb3dd640 (LWP 80112) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbee0 ) at ./nptl/futex-internal.c:57 27 Thread 0x7f92b63db640 (LWP 80114) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbfe0 ) at ./nptl/futex-internal.c:57 28 Thread 0x7f92b3bda640 (LWP 80115) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc060 ) at ./nptl/futex-internal.c:57 29 Thread 0x7f92ac3d7640 (LWP 80118) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc1e0 ) at ./nptl/futex-internal.c:57 30 Thread 0x7f92a73d5640 (LWP 80120) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc2e0 ) at ./nptl/futex-internal.c:57 31 Thread 0x7f92a4bd4640 (LWP 80121) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc360 ) at ./nptl/futex-internal.c:57 32 Thread 0x7f92a9bd6640 (LWP 80119) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc260 ) at ./nptl/futex-internal.c:57 33 Thread 0x7f929fbd2640 (LWP 80123) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc460 ) at ./nptl/futex-internal.c:57 34 Thread 0x7f92aebd8640 (LWP 80117) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc160 ) at ./nptl/futex-internal.c:57 35 Thread 0x7f92a23d3640 (LWP 80122) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc3e0 ) at ./nptl/futex-internal.c:57 36 Thread 0x7f929abd0640 (LWP 80125) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc560 ) at ./nptl/futex-internal.c:57 37 Thread 0x7f92983cf640 (LWP 80126) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc5e0 ) at ./nptl/futex-internal.c:57 38 Thread 0x7f929d3d1640 (LWP 80124) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc4e0 ) at ./nptl/futex-internal.c:57 39 Thread 0x7f9295bce640 (LWP 80127) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc660 ) at ./nptl/futex-internal.c:57 40 Thread 0x7f92933cd640 (LWP 80128) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc6e0 ) at ./nptl/futex-internal.c:57 41 Thread 0x7f9290bcc640 (LWP 80129) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc760 ) at ./nptl/futex-internal.c:57 42 Thread 0x7f928e3cb640 (LWP 80130) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc7e0 ) at ./nptl/futex-internal.c:57 43 Thread 0x7f928bbca640 (LWP 80131) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc860 ) at ./nptl/futex-internal.c:57 44 Thread 0x7f9286bc8640 (LWP 80133) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc960 ) at ./nptl/futex-internal.c:57 45 Thread 0x7f92843c7640 (LWP 80134) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc9e0 ) at ./nptl/futex-internal.c:57 46 Thread 0x7f9281bc6640 (LWP 80135) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcca60 ) at ./nptl/futex-internal.c:57 47 Thread 0x7f927a3c3640 (LWP 80138) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccbe0 ) at ./nptl/futex-internal.c:57 48 Thread 0x7f92893c9640 (LWP 80132) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc8e0 ) at ./nptl/futex-internal.c:57 49 Thread 0x7f927f3c5640 (LWP 80136) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccae0 ) at ./nptl/futex-internal.c:57 50 Thread 0x7f9308bfc640 (LWP 80081) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcaf60 ) at ./nptl/futex-internal.c:57 51 Thread 0x7f93063fb640 (LWP 80082) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcafe0 ) at ./nptl/futex-internal.c:57 52 Thread 0x7f9277bc2640 (LWP 80139) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dccc60 ) at ./nptl/futex-internal.c:57 53 Thread 0x7f9301bfa640 (LWP 80083) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb060 ) at ./nptl/futex-internal.c:57 54 Thread 0x7f92f23f3640 (LWP 80090) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb3e0 ) at ./nptl/futex-internal.c:57 55 Thread 0x7f92f73f5640 (LWP 80088) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb2e0 ) at ./nptl/futex-internal.c:57 56 Thread 0x7f92ed3f1640 (LWP 80092) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb4e0 ) at ./nptl/futex-internal.c:57 57 Thread 0x7f92eabf0640 (LWP 80093) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb560 ) at ./nptl/futex-internal.c:57 58 Thread 0x7f92e33ed640 (LWP 80096) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb6e0 ) at ./nptl/futex-internal.c:57 59 Thread 0x7f92e83ef640 (LWP 80094) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb5e0 ) at ./nptl/futex-internal.c:57 60 Thread 0x7f92de3eb640 (LWP 80098) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb7e0 ) at ./nptl/futex-internal.c:57 61 Thread 0x7f92d93e9640 (LWP 80100) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb8e0 ) at ./nptl/futex-internal.c:57 62 Thread 0x7f92d43e7640 (LWP 80102) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcb9e0 ) at ./nptl/futex-internal.c:57 63 Thread 0x7f92c53e1640 (LWP 80108) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcbce0 ) at ./nptl/futex-internal.c:57 64 Thread 0x7f92b13d9640 (LWP 80116) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f9310dcc0e0 ) at ./nptl/futex-internal.c:57 ``` -`list`: Show the source code (if available) around the current line. ``` (gdb) list 39 in ./nptl/pthread_kill.c ``` - `info sharedlibrary`: list shared libraries loaded by the program at the time of the crash. ``` (gdb) info sharedlibrary From To Syms Read Shared Object Library 0x00007f9327413e00 0x00007f93274353c3 Yes (*) /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 0x00007f93277b2040 0x00007f93277b2105 Yes /lib/x86_64-linux-gnu/libpthread.so.0 0x00007f93277ad040 0x00007f93277ad105 Yes /lib/x86_64-linux-gnu/libdl.so.2 0x00007f93277a8040 0x00007f93277a8105 Yes /lib/x86_64-linux-gnu/libutil.so.1 0x00007f93276ce3a0 0x00007f93277498c8 Yes /lib/x86_64-linux-gnu/libm.so.6 0x00007f9327028700 0x00007f93271ba93d Yes /lib/x86_64-linux-gnu/libc.so.6 0x00007f93276a5280 0x00007f93276ae5bf Yes (*) /lib/x86_64-linux-gnu/libunwind.so.8 0x00007f9326ca5150 0x00007f9326d95b31 Yes /home/raix/miniconda3/envs/v_xla/bin/../lib/libstdc++.so.6 0x00007f93277c0090 0x00007f93277e9315 Yes /lib64/ld-linux-x86-64.so.2 0x00007f9327677050 0x00007f9327693c51 Yes (*) /home/raix/miniconda3/envs/v_xla/bin/../lib/liblzma.so.5 0x00007f932765c320 0x00007f932766d6e1 Yes /home/raix/miniconda3/envs/v_xla/bin/../lib/libgcc_s.so.1 0x00007f932763c050 0x00007f9327643411 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/math.cpython-310-x86_64-linux-gnu.so 0x00007f9327632050 0x00007f9327633081 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/fcntl.cpython-310-x86_64-linux-gnu.so 0x00007f932762b050 0x00007f932762cf71 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_posixsubprocess.cpython-310-x86_64-linux-gnu.so 0x00007f9327621050 0x00007f93276231c1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/select.cpython-310-x86_64-linux-gnu.so 0x00007f9327290050 0x00007f932729d7d1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so 0x00007f9327611000 0x00007f9327619791 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libffi.so.8 0x00007f932727e050 0x00007f9327282a01 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_struct.cpython-310-x86_64-linux-gnu.so 0x00007f93277b7050 0x00007f93277b7391 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so 0x00007f9327604050 0x00007f9327607251 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/zlib.cpython-310-x86_64-linux-gnu.so 0x00007f932725f050 0x00007f9327270241 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libz.so.1 0x00007f9327256050 0x00007f9327257de1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so 0x00007f9327242050 0x00007f932724f431 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libbz2.so.1.0 0x00007f9327237050 0x00007f932723a8f1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so 0x00007f932722f050 0x00007f9327230031 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_bisect.cpython-310-x86_64-linux-gnu.so 0x00007f9326efb050 0x00007f9326efcbb1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_random.cpython-310-x86_64-linux-gnu.so 0x00007f9326ef1050 0x00007f9326ef5bf1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_sha512.cpython-310-x86_64-linux-gnu.so 0x00007f9326eeb050 0x00007f9326eeb105 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so 0x00007f9325458390 0x00007f9325f531c0 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_intel_lp64.so 0x00007f9323602bf0 0x00007f9324da8feb Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_gnu_thread.so 0x00007f931f01ab00 0x00007f9322713b80 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libmkl_core.so 0x00007f9326eb1730 0x00007f9326edbec1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/../../../../libgomp.so.1 0x00007f9326ea2050 0x00007f9326ea2115 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so 0x00007f931de21b40 0x00007f931e9432b8 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_python.so 0x00007f9326e9a440 0x00007f9326e9c5d3 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libshm.so 0x00007f9326e81890 0x00007f9326e8dc90 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch.so 0x00007f93125311c0 0x00007f931b914530 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so 0x00007f9326523270 0x00007f93265accb4 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch/lib/libc10.so 0x00007f9326e66080 0x00007f9326e66275 Yes /lib/x86_64-linux-gnu/librt.so.1 0x00007f931102da70 0x00007f9311508663 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so 0x00007f930ef18000 0x00007f9310b98c5c Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so --Type for more, q to quit, c to continue without paging--c 0x00007f930e81b870 0x00007f930ea46837 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libgfortran-040039e1.so.5.0.0 0x00007f930e4023e0 0x00007f930e425d2b Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/../../numpy.libs/libquadmath-96973f99.so.0.0.0 0x00007f9326e4b050 0x00007f9326e5b571 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_datetime.cpython-310-x86_64-linux-gnu.so 0x00007f9326e2a050 0x00007f9326e3bf61 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so 0x00007f9326e6c050 0x00007f9326e6c211 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so 0x00007f93250e0e70 0x00007f93250f6299 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/core/_multiarray_tests.cpython-310-x86_64-linux-gnu.so 0x00007f93250aec20 0x00007f93250cdda2 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/linalg/_umath_linalg.cpython-310-x86_64-linux-gnu.so 0x00007f9325091170 0x00007f93250a3adf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/fft/_pocketfft_internal.cpython-310-x86_64-linux-gnu.so 0x00007f930ed528f0 0x00007f930edaafd4 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/mtrand.cpython-310-x86_64-linux-gnu.so 0x00007f931edcf8f0 0x00007f931edf12cf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-x86_64-linux-gnu.so 0x00007f931ed90830 0x00007f931edc12ab Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_common.cpython-310-x86_64-linux-gnu.so 0x00007f9326e1b050 0x00007f9326e1eff1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/binascii.cpython-310-x86_64-linux-gnu.so 0x00007f9326af3050 0x00007f9326af8101 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so 0x00007f92706b9000 0x00007f927090412f Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libcrypto.so.3 0x00007f9325084050 0x00007f932508b551 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_blake2.cpython-310-x86_64-linux-gnu.so 0x00007f931dbaf840 0x00007f931dbf3fdf Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_bounded_integers.cpython-310-x86_64-linux-gnu.so 0x00007f93232e85e0 0x00007f93232f86d2 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_mt19937.cpython-310-x86_64-linux-gnu.so 0x00007f931ed77610 0x00007f931ed8502c Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_philox.cpython-310-x86_64-linux-gnu.so 0x00007f931db92600 0x00007f931dba35eb Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_pcg64.cpython-310-x86_64-linux-gnu.so 0x00007f93232d7590 0x00007f93232df1ba Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_sfc64.cpython-310-x86_64-linux-gnu.so 0x00007f930e727c90 0x00007f930e7a6f67 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/numpy/random/_generator.cpython-310-x86_64-linux-gnu.so 0x00007f93116f8050 0x00007f93116faa21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_heapq.cpython-310-x86_64-linux-gnu.so 0x00007f9326aeb050 0x00007f9326aeba21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/grp.cpython-310-x86_64-linux-gnu.so 0x00007f93116ed050 0x00007f93116f2eb1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so 0x00007f93116d9050 0x00007f93116e2f81 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/cmath.cpython-310-x86_64-linux-gnu.so 0x00007f9310eec050 0x00007f9310ef41c1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_socket.cpython-310-x86_64-linux-gnu.so 0x00007f9310ed9050 0x00007f9310edfa81 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/array.cpython-310-x86_64-linux-gnu.so 0x00007f931db8b050 0x00007f931db8be21 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so 0x00007f9326e15050 0x00007f9326e15231 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so 0x00007f93116d0050 0x00007f93116d35c1 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libuuid.so.1 0x00007f9310eb3050 0x00007f9310ebc351 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so 0x00007f930ecca050 0x00007f930ed1f993 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/../../libssl.so.3 0x00007f926f8f1050 0x00007f926f8f4461 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/unicodedata.cpython-310-x86_64-linux-gnu.so 0x00007f9310e9c050 0x00007f9310e9ca01 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so 0x00007f9310e8e050 0x00007f9310e92581 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so 0x00007f92637c5480 0x00007f926c0cf9ef Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so 0x00007f925ee5c050 0x00007f925f066fc1 Yes /home/raix/miniconda3/envs/v_xla/lib/libpython3.10.so.1.0 0x00007f930e6e8040 0x00007f930e6fb97b Yes (*) /lib/x86_64-linux-gnu/libcrypt.so.1 0x00007f9310e85090 0x00007f9310e851b5 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md.cpython-310-x86_64-linux-gnu.so 0x00007f930e6bb280 0x00007f930e6d5e05 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/charset_normalizer/md__mypyc.cpython-310-x86_64-linux-gnu.so 0x00007f930eca0050 0x00007f930eca56f1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_multibytecodec.cpython-310-x86_64-linux-gnu.so 0x00007f930e67d050 0x00007f930e6a36e1 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/_yaml.cpython-310-x86_64-linux-gnu.so 0x00007f930e65a040 0x00007f930e66ff45 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/yaml/../../../libyaml-0.so.2 0x00007f926e674050 0x00007f926e6c5421 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/gmpy2.cpython-310-x86_64-linux-gnu.so 0x00007f925e804a30 0x00007f925e814b53 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpc.so.3 0x00007f9270a4f040 0x00007f9270aaba57 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libmpfr.so.6 0x00007f925eb6d080 0x00007f925ebe4326 Yes (*) /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/gmpy2/../../../libgmp.so.10 0x00007f925f1ba050 0x00007f925f1ebbf1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_decimal.cpython-310-x86_64-linux-gnu.so 0x00007f930ec97050 0x00007f930ec97ef1 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so 0x00007f930e652050 0x00007f930e653f41 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_lsprof.cpython-310-x86_64-linux-gnu.so 0x00007f925d146090 0x00007f925d1d12ef Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/site-packages/safetensors/_safetensors_rust.cpython-310-x86_64-linux-gnu.so 0x00007f930e646050 0x00007f930e64a181 Yes /home/raix/miniconda3/envs/v_xla/lib/python3.10/lib-dynload/_csv.cpython-310-x86_64-linux-gnu.so (*): Shared library is missing debugging information. ```
PawKanarek commented 9 months ago

@JackCaoG Now i created the v4-8 machine with this vm version: tpu-vm-v4-pt-2.0

gcloud compute tpus tpu-vm create myname --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-v4-pt-2.0

And now Im getting different message, but at least it's now readable :)

python server/server.py 
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710013726.247688   30296 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710013726.247769   30296 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710013726.247774   30296 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/runtime.py:247: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
  warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.07it/s]
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
/home/me/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/transformers/training_args.py:1827: FutureWarning: `--push_to_hub_model_id` and `--push_to_hub_organization` are deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_model_id` instead and pass the full repo name to this argument (in this case google/gemma-2-it).
  warnings.warn(
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map= 
*** SIGSEGV (@0x1d8), see go/stacktraces#s15 received by PID 30296 (TID 31841) on cpu 195; stack trace: ***
PC: @     0x7f8ddd4d4953  (unknown)  torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()::{lambda()#1}::operator()()
    @     0x7f8d6c18c6a7        928  (unknown)
    @     0x7f8ea111e3c0       1984  (unknown)
    @     0x7f8de5b4364e         32  std::_Function_handler<>::_M_invoke()
    @     0x7f8dde56762e        288  Eigen::ThreadPoolDevice::parallelFor()
    @     0x7f8de5b46274        576  tsl::thread::ThreadPool::ParallelFor()
    @     0x7f8dddde807b       1168  torch_xla::runtime::PjRtComputationClient::ExecuteReplicated()
    @     0x7f8dddbdb4eb        624  torch_xla::XLAGraphExecutor::ScheduleSyncTensorsGraph()::{lambda()#1}::operator()()
    @     0x7f8e9051550a  (unknown)  torch::lazy::MultiWait::Complete()
    @ ... and at least 1 more frames
https://symbolize.stripped_domain/r/?trace=7f8ddd4d4953,7f8d6c18c6a6,7f8ea111e3bf,7f8de5b4364d,7f8dde56762d,7f8de5b46273,7f8dddde807a,7f8dddbdb4ea,7f8e90515509&map= 
E0309 19:48:53.091365   31841 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0309 19:48:53.091373   31841 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0309 19:48:53.091379   31841 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0309 19:48:53.091381   31841 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0309 19:48:53.091395   31841 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0309 19:48:53.091399   31841 coredump_hook.cc:598] RAW: Dumping core locally.
E0309 19:48:53.337414   31841 process_state.cc:807] RAW: Raising signal 11 with default behavior
Segmentation fault (core dumped)
JackCaoG commented 8 months ago

Can you follow https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#sanity-check to run a resnet with fakedata? I am not sure if it is a env setup issue or gemma issue in your case.

PawKanarek commented 8 months ago

Thanks for advice, sanity check looks good on this tpu imports:

python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla
>>> print(torch.__version__)
2.3.0.dev20240309
>>> print(torch_xla.__version__)
2.3.0+git6043185

simple calculation

python3
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla.core.xla_model as xm
>>> t1 = torch.tensor(100, device=xm.xla_device())
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1710270930.199793  326792 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/raix/miniconda3/envs/tpu_v4/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1710270930.199885  326792 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1710270930.199890  326792 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
>>> t2 = torch.tensor(200, device=xm.xla_device())
>>> print(t1 + t2)
tensor(300, device='xla:0')
>>> 

imagenet

Epoch 18 train end 20:13:54
| Test Device=xla:0/0 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=0 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=20 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/0 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/3 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/2 Step=40 Epoch=18 Time=20:13:54
| Test Device=xla:0/1 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=60 Epoch=18 Time=20:13:55
| Test Device=xla:0/1 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/2 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/0 Step=80 Epoch=18 Time=20:13:55
| Test Device=xla:0/3 Step=80 Epoch=18 Time=20:13:55
Epoch 18 test end 20:13:55, Accuracy=100.00
Max Accuracy: 100.00%
alanwaketan commented 8 months ago

@PawKanarek For Gemma, have you set the following env: PJRT_DEVICE=TPU XLA_USE_SPMD=1 ?

PawKanarek commented 8 months ago

It seems that setting export PJRT_DEVICE=TPU and export XLA_USE_SPMD=1 resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!

alanwaketan commented 8 months ago

It seems that setting export PJRT_DEVICE=TPU and export XLA_USE_SPMD=1 resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training on larger datasets. But no problems on smaller datasets. Thanks!

I would love to learn more about the crash as well! Do you mind open a new bug?

alanwaketan commented 8 months ago

@windmaple @PawKanarek Are we good to close this issue?

PawKanarek commented 8 months ago

The problem with AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh' was resolved on my machine.