Closed henk717 closed 2 months ago
I'm encountering the same issue when loading GPT-J. It was working fine until 24 hours ago approximately.
I'm encountering the same issue when loading GPT-J. It was working fine until 24 hours ago approximately.
Our code is based on MTJ which the original GPT-J runs on top off and happens prior to loading the model, both use the older V1 implementation of the model. So its probable this effects all MTJ users.
Mine is trying to connect grpc://10.63.28.250:8470 and errors so it's pretty much everywhere. Furthermore, it's also taking unusually long to collect pathy and uvicorn, etc..
I have pinpointed the issue down to the driver version the projects use. It looks like the older ones are no longer working. For example tpu_driver0.1_dev20210607 is being used in our project, when paired with the following code you get the error:
!pip install jax jaxlib
import requests
import os
import jax
from jax.config import config
print("Connecting to your Colab instance's TPU", flush=True)
if os.environ.get('COLAB_TPU_ADDR', '') != '':
tpu_address = os.environ['COLAB_TPU_ADDR'] # Colab
else:
tpu_address = os.environ['TPU_NAME'] # Kaggle
tpu_address = tpu_address.replace("grpc://", "")
tpu_address_without_port = tpu_address.split(':', 1)[0]
url = f'http://{tpu_address_without_port}:8475/requestversion/tpu_driver0.1_dev20210607'
requests.post(url)
config.FLAGS.jax_xla_backend = "tpu_driver"
config.FLAGS.jax_backend_target = "grpc://" + tpu_address
print()
jax.devices()
I can't find a list of all available drivers, but collected 3 from bug reports and other colabs.
tpu_driver0.1_dev20210607 is used by us and produces the error, tpu_driver0.1-dev20211030 is newer and used by some examples where people recommend not to use the nightly, this also produces the error.
tpu_driver_20221011 is being used by some stable diffusion colabs and that one works in my example above. But unfortunately does not work with our MTJ notebook.
If someone knows a list of long term supported drivers I could test more of them and see if this fixes the issue for MTJ. Otherwise i'd like to politely request that the commonly used older drivers are restored in functionality. GPT-J and MTJ are still widely used but rely on older driver versions.
Update: Seems to effect all the 0.1 drivers.
GPT-J doesn't work with tpu_driver_20221011 either.
GPT-J won't work with that indeed, but it does make a difference between connecting to the TPU and getting the deadline errors. We will have to wait for the Google engineers to fix the 0.1 drivers we depend upon, for the time being Kaggle still works so if you have something urgent that can be done on Kaggle I recommend checking there until they have some time to fix it.
Thank you @henk717 I heavily use GPT-J everyday for work so I'll need it running from Monday morning. I hope this is a temporary issue.
I hope so to, but breaking the entire 0.1 driver ecosystem does not sound like the thing they did on purpose and won't be interested in fixing before this gets installed on things like Kaggle and Google Compute.
My theory is the TPUv2 firmware update that causes this either has been spread everywhere and the TPUv3 is unaffected, or they used Colab as a testing ground to see if people would run into issues and we are the first to notice because we rely on a dependency from the 2021 TPU era.
Is there a way to infer GPT-J from jupyter notebook on TPU machine of GCP?
Tagging @ultrons since he is the project manager for the TPU's. He may be able to get this to the right person. Thousands depend on MTJ for inference since it can be used to automatically load some huggingface pytorch models on the TPU.
But especially since this is a failure to initialize the TPU at a very basic level. With the 0.1 driver resulting in a broken unresponsive TPU I expect this effects more colab users than the ones depending on MTJ. And if this same firmware bug spreads outside of colab more TPU customers could be effected on the entire google cloud.
I'm subscribing pro for TPU. If it stays uninitializable, it's no use..
Same error here, trying to run Colab on TPU. GPU alternatives are practically unusable for the stuff I'm doing, so I really need that TPU up and running. Otherwise, my Pro sub ain't worth much of anything.
Can confirm this problem with the GPT models I use. I can't run them because of the same problem.
@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.
@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.
You mean, full driver path would be:
colab_tpu_addr = os.environ['COLAB_TPU_ADDR'].split(':')[0]
url = f'http://{colab_tpu_addr}:8475/requestversion/tpu_driver0.2'
?
@henk717 Thanks for reporting the issue and thanks for using Colab. I can confirm that specifying the 0.1dev does not work, but taking the default and specifying the 0.2 drivers does work. Tracking internally at b/269607171.
You mean, full driver path would be:
colab_tpu_addr = os.environ['COLAB_TPU_ADDR'].split(':')[0] url = f'http://{colab_tpu_addr}:8475/requestversion/tpu_driver0.2'
?
This is indeed correct , the 0.2 drivers and newer (including the ones that just use a 2022 version number without the other versioning) load fine. If your notebook is compatible with the newer drivers this can solve the issue for you, unfortunately a lot of the notebooks that directly call for a 0.1 driver will break when this is attempting because of incompatibilities.
You can find a sample notebook here : https://colab.research.google.com/drive/1YDcZJ4EMOd3f_kuk0RnD5AJBEpUhMl2I#revisionId=0B7OnP7aLuFgXMXFiZU9sNDZnWmNpVmVzaWc1YlhYaEF6ZnAwPQ
@metrizable Thank you for taking care of this issue. Initializing the default and the 0.2 drivers are possible, but it causes crash when creating network of GPT-J, and probably its derivatives. So unfortunately I don't think these can be a temporary remedy.
when will this be implemented on henk.tech/ckds? looking to use with colab kobold deployment script as currently have no way of converting my mesh transformer jax weights to HF.
when will this be implemented on henk.tech/ckds? looking to use with colab kobold deployment script as currently have no way of converting my mesh transformer jax weights to HF.
You are commenting on a google issue not a Kobold issue. If it was as simple as changing versions I would have done so, but it is not the ckds script that decides this. Mesh Transformers Jax itself requests the broken driver and does so because it is not compatible with anything newer.
Since its a very low level dependency issue I am unable to resolve that myself as it requires deep knowledge of the TPU itself.
Our operations team uses MTJ on a daily basis and hasn't been able to since the TPUs went down. Really hoping this gets resolved
@wingk10 I don't want to make this post lengthy but same here for me. Kaggle is affected and now the queue for TPU has over 70 users, which means we are likely to wait for 6 hours. We are even losing the alternatives. That said, I see the situation evolving quietly. On Monday, 0.1 was initializable but didn't create network. Now the default is not initializable either. (but 0.2 is. Seems quite random.)
Yes, we're running into the same issue re: Kaggle. 64 users in the queue right now, and seems to have gotten worse suddenly over the past few days. Obviously, I can't do anything but post and go "hey, it's important to me too", with no alternatives (extra useless here). I hope we hear more soon.
I decided to pay $10/h and tried to connect vertex AI to cloud TPU, but there was no available TPU in my bucket region. So, there are really no alternatives.
Updates please?
I purchased Colab Pro+ last month and have barely been able to use it due to this issue 😭 Been 2 weeks since last update in this thread, would be great to get some news, even if it's just "still can't seem to get the danged thing working".
It looks like either it's very difficult to fix or nobody cares.
@metrizable Are there any updates on this? A lot of people rely on this setup of TPUs working for work and it's been some time since we heard anything.
In my own community its also thousands of people wanting to use the TPU's for inference again. I am personally in the same boat as all the other Pro users here where the only reason I had pro is that it is an affordable high ram solution. For the GPU side I have different alternatives.
At least at this moment, the issue has been removed and now it's working for GPT-J. Good News.
Same here! Seems to be working for me.
Our community has taken notice to and multiple users confirmed the issue is resolved, ill close this issue for now.
Glad it's resolved. For reference, we have no ability to get the older 0.1 TPU drivers up and running again. The outdated drivers were removed from a recent build upstream we haven't found any way to get those back in.
You'll need to upgrade to more recent drivers. Sorry.
@cperry-goog Our dependencies did not change, to my knowledge it is not literally using 0.1 drivers but it does initialize again when 0.1 is requested and operates in a compatible mode. Please keep this functional now that it works again. MTJ isn't actively maintained upstream but thousands depend on the fix that was pushed today.
I assume one of the TPU engineers managed to get it running again in the correct backwards compatible mode even if it is not running a real 0.1 driver. It would help if that part is marked as important for the ecosystem so it does not get broken again in the next updates.
seems to be breaking again...
seems to be breaking again...
I see the problem too...
Can confirm, exact same issue is back.
Yes, it's happening again. (Yesterday JAX related bug occurred in a couple of different forms. I thought it was because of the upgrade to Python3.9 from last week, but it wasn't. That issue was resolved by itself in a few hours.)
We also had a handful every week but those were always a thing and could be resolved by getting a new notebook. So when people report it normally they could get back in. But since last night its constant.
At this moment, none of 0.1, 0.2, nightly nor default driver can be initialized so the magnitude is higher than last time.
I know this post is about colab, but let me add that I confirmed kaggle causes the same TPU initialization error this time. So there's no alternative now.
At this moment, none of 0.1, 0.2, nightly nor default driver can be initialized so the magnitude is higher than last time.
So apparently not just 0.1 driver issue, because when they said we cannot access 0.1 anymore, it still worked after their suppsoed changes several weeks ago.
@candymint23 By 4 hours ago, nightly and other drivers worked but 0.1. Apparently they are fixing the issue from other drivers.
how long do these outages normal last?
Last time it was weeks, alao keep in mind when testing that incorrect drivers can hang. You need an empty colab session per driver attempt for reliable results. So far I only had the issue on 0.1.
Alao been looking at potential alternative frameworks such as pytorch xla with no success (Even 2.7B models didn't fit but its possible I did something wrong there). On those frameworks it does initialize the TPU fine, so its specifically Jax 0.1
so i take it this is effecting everyone, its not something only a hand full of us are missing or doing wrong?
It is effecting anyone using MTJ or code relying on the older TPU Jax drivers. From what I saw Jax related projects older than one year old. So that is GPT-J, KoboldAI, and others who rely on MTJ but also projects beyond that I have no knowledge off.
Unfortunetly MTJ can't run properly on the newer driver so we are stuck on that front.
i have no idea how this stuff works but is there a way to "ping" the driver(?) to see if its up/running/whatever the equivalent is so we can track when it back up or is there usually an announcement?
again, i have no idea what i'm talking about i'd just pulling this out of the voice assuming it works something like a server with no basis to do so lol
sorry if i come across as a house brick.
A new symptom appeared. Though I need to use jaxlib 0.1.68, it requires 0.1.74.
On those frameworks it does initialize the TPU fine, so its specifically Jax 0.1
I'm guessing this whole situation is arising from a recent change in JAX. (Just guessing)
At this moment, TPU_driver0.1 can be initialized on colab (but not kaggle) but it can't create network.
do we know if anyone is looking into this problem?
Describe the current behavior When running an older version of JAX, the TPU receives the following error: Traceback (most recent call last): File "aiserver.py", line 10214, in
load_model(initial_load=True)
File "aiserver.py", line 2806, in load_model
tpu_mtj_backend.load_model(vars.custmodpth, hf_checkpoint=vars.model not in ("TPUMeshTransformerGPTJ", "TPUMeshTransformerGPTNeoX") and vars.use_colab_tpu, **vars.modelconfig)
File "/content/KoboldAI-Client/tpu_mtj_backend.py", line 1194, in load_model
devices = np.array(jax.devices()[:cores_per_replica]).reshape(mesh_shape)
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 314, in devices
return get_backend(backend).devices()
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 258, in get_backend
return _get_backend_uncached(platform)
File "/usr/local/lib/python3.8/dist-packages/jax/_src/lib/xla_bridge.py", line 248, in _get_backend_uncached
raise RuntimeError(f"Requested backend {platform}, but it failed "
RuntimeError: Requested backend tpu_driver, but it failed to initialize: DEADLINE_EXCEEDED: Failed to connect to remote server at address: grpc://10.106.231.74:8470. Error from gRPC: Deadline Exceeded. Details:
This happens for all users of the notebook on Colab, while Kaggle is still working as intended.
Describe the expected behavior Jax is correctly able to connect to the TPU and can then proceed with loading the user defined model.
What web browser you are using This issue does not depend on a browser, but for completeness I am using an up to date Microsoft Edge.
Additional context Here is an example of an effected notebook:
The relevant backend code can be found here : https://github.com/KoboldAI/KoboldAI-Client/blob/main/tpu_mtj_backend.py This also makes use of a heavily modified MTJ with the following relevant dependencies: jax == 0.2.21 jaxlib >= 0.1.69, <= 0.3.7 git+https://github.com/VE-FORBRYDERNE/mesh-transformer-jax@ck
MTJ uses tpu_driver0.1_dev20210607