googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.18k stars 716 forks source link

Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED using 0.1 drivers since 10/02/2023 #3405

Closed henk717 closed 2 months ago

henk717 commented 1 year ago

Describe the current behavior When running an older version of JAX, the TPU receives the following error: Traceback (most recent call last): File "aiserver.py", line 10214, in load_model(initial_load=True) File "aiserver.py", line 2806, in load_model tpu_mtj_backend.load_model(vars.custmodpth, hf_checkpoint=vars.model not in ("TPUMeshTransformerGPTJ", "TPUMeshTransformerGPTNeoX") and vars.use_colab_tpu, **vars.modelconfig) File "/content/KoboldAI-Client/tpu_mtj_backend.py", line 1194, in load_model devices = np.array(jax.devices()[:cores_per_replica]).reshape(mesh_shape) File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 314, in devices return get_backend(backend).devices() File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 258, in get_backend return _get_backend_uncached(platform) File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 248, in _get_backend_uncached raise RuntimeError(f"Requested backend {platform}, but it failed " RuntimeError: Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED: Failed to connect to remote server at address: grpc://10.106.231.74:8470. Error from gRPC: Deadline Exceeded. Details:

This happens for all users of the notebook on Colab, while Kaggle is still working as intended.

Describe the expected behavior Jax is correctly able to connect to the TPU and can then proceed with loading the user defined model.

What web browser you are using This issue does not depend on a browser, but for completeness I am using an up to date Microsoft Edge.

Additional context Here is an example of an effected notebook:

import os
if not os.path.exists("/content/drive"):
  os.mkdir("/content/drive")
if not os.path.exists("/content/drive/MyDrive/"):
  os.mkdir("/content/drive/MyDrive/")

!wget https://koboldai.org/ckds -O - | bash /dev/stdin --model EleutherAI/gpt-neox-20b

The relevant backend code can be found here : https://github.com/KoboldAI/KoboldAI-Client/blob/main/tpu_mtj_backend.py This also makes use of a heavily modified MTJ with the following relevant dependencies: jax == 0.2.21 jaxlib >= 0.1.69, <= 0.3.7 git+https://github.com/VE-FORBRYDERNE/mesh-transformer-jax@ck

MTJ uses tpu_driver0.1_dev20210607

mosmos6 commented 1 year ago

I'm encountering the same issue when loading GPT-J. It was working fine until 24 hours ago approximately.

henk717 commented 1 year ago

I'm encountering the same issue when loading GPT-J. It was working fine until 24 hours ago approximately.

Our code is based on MTJ which the original GPT-J runs on top off and happens prior to loading the model, both use the older V1 implementation of the model. So its probable this effects all MTJ users.

mosmos6 commented 1 year ago

Mine is trying to connect grpc://10.63.28.250:8470 and errors so it's pretty much everywhere. Furthermore, it's also taking unusually long to collect pathy and uvicorn, etc..

henk717 commented 1 year ago

I have pinpointed the issue down to the driver version the projects use. It looks like the older ones are no longer working. For example tpu_driver0.1_dev20210607 is being used in our project, when paired with the following code you get the error:

!pip install jax jaxlib 

import requests
import os
import jax

from jax.config import config

print("Connecting to your Colab instance's TPU", flush=True)
if os.environ.get('COLAB_TPU_ADDR', '') != '':
    tpu_address = os.environ['COLAB_TPU_ADDR']  # Colab
else:
    tpu_address = os.environ['TPU_NAME']  # Kaggle
tpu_address = tpu_address.replace("grpc://", "")
tpu_address_without_port = tpu_address.split(':', 1)[0]
url = f'http://{tpu_address_without_port}:8475/requestversion/tpu_driver0.1_dev20210607'
requests.post(url)
config.FLAGS.jax_xla_backend = "tpu_driver"
config.FLAGS.jax_backend_target = "grpc://" + tpu_address
print()

jax.devices()

I can't find a list of all available drivers, but collected 3 from bug reports and other colabs.

tpu_driver0.1_dev20210607 is used by us and produces the error, tpu_driver0.1-dev20211030 is newer and used by some examples where people recommend not to use the nightly, this also produces the error.

tpu_driver_20221011 is being used by some stable diffusion colabs and that one works in my example above. But unfortunately does not work with our MTJ notebook.

If someone knows a list of long term supported drivers I could test more of them and see if this fixes the issue for MTJ. Otherwise i'd like to politely request that the commonly used older drivers are restored in functionality. GPT-J and MTJ are still widely used but rely on older driver versions.

Update: Seems to effect all the 0.1 drivers.

mosmos6 commented 1 year ago

GPT-J doesn't work with tpu_driver_20221011 either.

henk717 commented 1 year ago

GPT-J won't work with that indeed, but it does make a difference between connecting to the TPU and getting the deadline errors. We will have to wait for the Google engineers to fix the 0.1 drivers we depend upon, for the time being Kaggle still works so if you have something urgent that can be done on Kaggle I recommend checking there until they have some time to fix it.

mosmos6 commented 1 year ago

Thank you @henk717 I heavily use GPT-J everyday for work so I'll need it running from Monday morning. I hope this is a temporary issue.

henk717 commented 1 year ago

I hope so to, but breaking the entire 0.1 driver ecosystem does not sound like the thing they did on purpose and won't be interested in fixing before this gets installed on things like Kaggle and Google Compute.

My theory is the TPUv2 firmware update that causes this either has been spread everywhere and the TPUv3 is unaffected, or they used Colab as a testing ground to see if people would run into issues and we are the first to notice because we rely on a dependency from the 2021 TPU era.

mosmos6 commented 1 year ago

Is there a way to infer GPT-J from jupyter notebook on TPU machine of GCP?

henk717 commented 1 year ago

Tagging @ultrons since he is the project manager for the TPU's. He may be able to get this to the right person. Thousands depend on MTJ for inference since it can be used to automatically load some huggingface pytorch models on the TPU.

But especially since this is a failure to initialize the TPU at a very basic level. With the 0.1 driver resulting in a broken unresponsive TPU I expect this effects more colab users than the ones depending on MTJ. And if this same firmware bug spreads outside of colab more TPU customers could be effected on the entire google cloud.

mosmos6 commented 1 year ago

I'm subscribing pro for TPU. If it stays uninitializable, it's no use..

Kipcreate commented 1 year ago

Same error here, trying to run Colab on TPU. GPU alternatives are practically unusable for the stuff I'm doing, so I really need that TPU up and running. Otherwise, my Pro sub ain't worth much of anything.

candymint23 commented 1 year ago

Can confirm this problem with the GPT models I use. I can't run them because of the same problem.

metrizable commented 1 year ago

@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.

somsomers commented 1 year ago

@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.

You mean, full driver path would be:

colab_tpu_addr = os.environ['COLAB_TPU_ADDR'].split(':')[0]
url = f'http://{colab_tpu_addr}:8475/requestversion/tpu_driver0.2'

?

henk717 commented 1 year ago

@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.

You mean, full driver path would be:

colab_tpu_addr = os.environ['COLAB_TPU_ADDR'].split(':')[0]
url = f'http://{colab_tpu_addr}:8475/requestversion/tpu_driver0.2'

?

This is indeed correct , the 0.2 drivers and newer (including the ones that just use a 2022 version number without the other versioning) load fine. If your notebook is compatible with the newer drivers this can solve the issue for you, unfortunately a lot of the notebooks that directly call for a 0.1 driver will break when this is attempting because of incompatibilities.

You can find a sample notebook here : https://colab.research.google.com/drive/1YDcZJ4EMOd3f_kuk0RnD5AJBEpUhMl2I#revisionId=0B7OnP7aLuFgXMXFiZU9sNDZnWmNpVmVzaWc1YlhYaEF6ZnAwPQ

mosmos6 commented 1 year ago

@metrizable Thank you for taking care of this issue. Initializing the default and the 0.2 drivers are possible, but it causes crash when creating network of GPT-J, and probably its derivatives. So unfortunately I don't think these can be a temporary remedy.

DamascusGit commented 1 year ago

when will this be implemented on henk.tech/ckds? looking to use with colab kobold deployment script as currently have no way of converting my mesh transformer jax weights to HF.

henk717 commented 1 year ago

when will this be implemented on henk.tech/ckds? looking to use with colab kobold deployment script as currently have no way of converting my mesh transformer jax weights to HF.

You are commenting on a google issue not a Kobold issue. If it was as simple as changing versions I would have done so, but it is not the ckds script that decides this. Mesh Transformers Jax itself requests the broken driver and does so because it is not compatible with anything newer.

Since its a very low level dependency issue I am unable to resolve that myself as it requires deep knowledge of the TPU itself.

wingk10 commented 1 year ago

Our operations team uses MTJ on a daily basis and hasn't been able to since the TPUs went down. Really hoping this gets resolved

mosmos6 commented 1 year ago

@wingk10 I don't want to make this post lengthy but same here for me. Kaggle is affected and now the queue for TPU has over 70 users, which means we are likely to wait for 6 hours. We are even losing the alternatives. That said, I see the situation evolving quietly. On Monday, 0.1 was initializable but didn't create network. Now the default is not initializable either. (but 0.2 is. Seems quite random.)

wingk10 commented 1 year ago

Yes, we're running into the same issue re: Kaggle. 64 users in the queue right now, and seems to have gotten worse suddenly over the past few days. Obviously, I can't do anything but post and go "hey, it's important to me too", with no alternatives (extra useless here). I hope we hear more soon.

mosmos6 commented 1 year ago

I decided to pay $10/h and tried to connect vertex AI to cloud TPU, but there was no available TPU in my bucket region. So, there are really no alternatives.

candymint23 commented 1 year ago

Updates please?

donkawechico commented 1 year ago

I purchased Colab Pro+ last month and have barely been able to use it due to this issue 😭 Been 2 weeks since last update in this thread, would be great to get some news, even if it's just "still can't seem to get the danged thing working".

somsomers commented 1 year ago

It looks like either it's very difficult to fix or nobody cares.

wingk10 commented 1 year ago

@metrizable Are there any updates on this? A lot of people rely on this setup of TPUs working for work and it's been some time since we heard anything.

henk717 commented 1 year ago

In my own community its also thousands of people wanting to use the TPU's for inference again. I am personally in the same boat as all the other Pro users here where the only reason I had pro is that it is an affordable high ram solution. For the GPU side I have different alternatives.

mosmos6 commented 1 year ago

At least at this moment, the issue has been removed and now it's working for GPT-J. Good News.

donkawechico commented 1 year ago

Same here! Seems to be working for me.

henk717 commented 1 year ago

Our community has taken notice to and multiple users confirmed the issue is resolved, ill close this issue for now.

cperry-goog commented 1 year ago

Glad it's resolved. For reference, we have no ability to get the older 0.1 TPU drivers up and running again. The outdated drivers were removed from a recent build upstream we haven't found any way to get those back in.

You'll need to upgrade to more recent drivers. Sorry.

henk717 commented 1 year ago

@cperry-goog Our dependencies did not change, to my knowledge it is not literally using 0.1 drivers but it does initialize again when 0.1 is requested and operates in a compatible mode. Please keep this functional now that it works again. MTJ isn't actively maintained upstream but thousands depend on the fix that was pushed today.

I assume one of the TPU engineers managed to get it running again in the correct backwards compatible mode even if it is not running a real 0.1 driver. It would help if that part is marked as important for the ecosystem so it does not get broken again in the next updates.

DamascusGit commented 1 year ago

seems to be breaking again...

AVKosterin commented 1 year ago

seems to be breaking again...

I see the problem too...

henk717 commented 1 year ago

Can confirm, exact same issue is back.

mosmos6 commented 1 year ago

Yes, it's happening again. (Yesterday JAX related bug occurred in a couple of different forms. I thought it was because of the upgrade to Python3.9 from last week, but it wasn't. That issue was resolved by itself in a few hours.)

henk717 commented 1 year ago

We also had a handful every week but those were always a thing and could be resolved by getting a new notebook. So when people report it normally they could get back in. But since last night its constant.

mosmos6 commented 1 year ago

At this moment, none of 0.1, 0.2, nightly nor default driver can be initialized so the magnitude is higher than last time.

mosmos6 commented 1 year ago

I know this post is about colab, but let me add that I confirmed kaggle causes the same TPU initialization error this time. So there's no alternative now.

candymint23 commented 1 year ago

At this moment, none of 0.1, 0.2, nightly nor default driver can be initialized so the magnitude is higher than last time.

So apparently not just 0.1 driver issue, because when they said we cannot access 0.1 anymore, it still worked after their suppsoed changes several weeks ago.

mosmos6 commented 1 year ago

@candymint23 By 4 hours ago, nightly and other drivers worked but 0.1. Apparently they are fixing the issue from other drivers.

Borrowdale commented 1 year ago

how long do these outages normal last?

henk717 commented 1 year ago

Last time it was weeks, alao keep in mind when testing that incorrect drivers can hang. You need an empty colab session per driver attempt for reliable results. So far I only had the issue on 0.1.

Alao been looking at potential alternative frameworks such as pytorch xla with no success (Even 2.7B models didn't fit but its possible I did something wrong there). On those frameworks it does initialize the TPU fine, so its specifically Jax 0.1

Borrowdale commented 1 year ago

so i take it this is effecting everyone, its not something only a hand full of us are missing or doing wrong?

henk717 commented 1 year ago

It is effecting anyone using MTJ or code relying on the older TPU Jax drivers. From what I saw Jax related projects older than one year old. So that is GPT-J, KoboldAI, and others who rely on MTJ but also projects beyond that I have no knowledge off.

Unfortunetly MTJ can't run properly on the newer driver so we are stuck on that front.

Borrowdale commented 1 year ago

i have no idea how this stuff works but is there a way to "ping" the driver(?) to see if its up/running/whatever the equivalent is so we can track when it back up or is there usually an announcement?

again, i have no idea what i'm talking about i'd just pulling this out of the voice assuming it works something like a server with no basis to do so lol

sorry if i come across as a house brick.

mosmos6 commented 1 year ago

A new symptom appeared. Though I need to use jaxlib 0.1.68, it requires 0.1.74.

  1. 0.1.68 had no problem until earlier this morning under the same conditions.
  2. The same issue appeared two days ago, and that time it disappeared in a few hours.
  3. In Kaggle, 0.1.68 suddenly became unable to be installed on Python3.7 two days ago.

On those frameworks it does initialize the TPU fine, so its specifically Jax 0.1

I'm guessing this whole situation is arising from a recent change in JAX. (Just guessing)

Screenshot 2023-03-15 100838

mosmos6 commented 1 year ago

At this moment, TPU_driver0.1 can be initialized on colab (but not kaggle) but it can't create network.

DevLance112 commented 1 year ago

do we know if anyone is looking into this problem?