replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.64k stars 535 forks source link

File uploads appear to be broken on replicate #1759

Open platform-kit opened 2 months ago

platform-kit commented 2 months ago
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.14/lib/python3.8/site-packages/cog/server/runner.py", line 359, in predict
    return _predict(
  File "/root/.pyenv/versions/3.8.14/lib/python3.8/site-packages/cog/server/runner.py", line 427, in _predict
    event_handler.set_output(event.payload)
  File "/root/.pyenv/versions/3.8.14/lib/python3.8/site-packages/cog/server/runner.py", line 246, in set_output
    self.p.output = self._upload_files(output)
  File "/root/.pyenv/versions/3.8.14/lib/python3.8/site-packages/cog/server/runner.py", line 308, in _upload_files
    raise FileUploadError("Got error trying to upload output files") from error
cog.server.runner.FileUploadError: Got error trying to upload output files

Model: https://replicate.com/camenduru/lgm

This is happening on many models, including ones that have been usable on Replicate for some time now.

See also: https://discord.com/channels/775512803439280149/779461485277347850/1243322233099653130 https://discord.com/channels/775512803439280149/1144193090970722394/1215721819143929917

also: search "upload error" or just "upload" for many more instances of people reporting this.

No response from the Replicate team that I could see.

Has something about cog's file output behavior changed?

This is the only conclusion I can come to, as the code I have used to export files (mp3, wav, etc) on my old models is not working on new ones.

For context - I am currently trying to help CambAI release a versoin of their Mars5-TTS that uses Replicate's native file output to return audio. However they haven't been able to make it work, nor have I.

Relevant issue: https://github.com/Camb-ai/MARS5-TTS/issues/40

mattt commented 2 months ago

Hi @platform-kit. Thanks for reporting this issue. File uploads appear to be working for other models like stability-ai/stable-diffusion-3, and I'm not aware of any historical or ongoing incidents around that. However, I was able to reproduce this failure myself.

Logs ``` Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 303, in _upload_files return self._file_uploader(output) File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 212, in file_uploader return upload_files(output, upload_file=upload_file) File "/usr/local/lib/python3.10/site-packages/cog/json.py", line 53, in upload_files return [upload_files(value, upload_file) for value in obj] File "/usr/local/lib/python3.10/site-packages/cog/json.py", line 53, in return [upload_files(value, upload_file) for value in obj] File "/usr/local/lib/python3.10/site-packages/cog/json.py", line 55, in upload_files with obj.open("rb") as f: File "/usr/local/lib/python3.10/pathlib.py", line 1119, in open return self._accessor.open(self, mode, buffering, encoding, errors, FileNotFoundError: [Errno 2] No such file or directory: 'workspace/gradio_output.mp4' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 359, in predict return _predict( File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 433, in _predict event_handler.set_output(event.payload) File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 246, in set_output self.p.output = self._upload_files(output) File "/usr/local/lib/python3.10/site-packages/cog/server/runner.py", line 308, in _upload_files raise FileUploadError("Got error trying to upload output files") from error cog.server.runner.FileUploadError: Got error trying to upload output files ```

The ultimate failure was cog.server.runner.FileUploadError, but the relevant part of the trace was here:

  File "/usr/local/lib/python3.10/site-packages/cog/json.py", line 53, in <listcomp>
    return [upload_files(value, upload_file) for value in obj]
  File "/usr/local/lib/python3.10/site-packages/cog/json.py", line 55, in upload_files
    with obj.open("rb") as f:
  File "/usr/local/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'workspace/gradio_output.mp4'

So the problem isn't that uploading is broken, it's that the file can't be located.

Looking at internal logs, the success rate for camenduru/lgm does appear to change around May 2nd, which corresponds to our rollout of Cog v0.9.6. But I can't tell for sure whether that's a coincidence.

Looking into the source code there are a few things that seem suspect:

  1. This call to os.chdir could be messing with where
  2. The cog.yaml file, which is... well, it's doing a lot

I can't account for why this stopped working, but I'll continue looking into it.

I've reached out to Camb-ai in the linked issue and will work with them to try to get to the bottom of this.

platform-kit commented 1 month ago

@mattt thanks for the quick response.

I've deployed a fork of the Mars5-TTS model to Replicate that returns a Path:

https://replicate.com/platform-kit/mars5-tts/versions/33a2ed337e20ecbb932a2c304ea42ba0903f0c11ba68421ca9676f79fae82317

code for that deploy's predict.py here: https://github.com/Render-AI/MARS5-TTS/blob/f27cd6e99ac08033ca04d5450a3d36433e85d9f7/cog/predict.py

According to the cog docs, For models that return a cog.Path object, the prediction output returned by Cog's built-in HTTP server will be a URL.

however the result is actually just a string with a relative path, not a url:

{
  "completed_at": "2024-06-24T03:52:37.387801Z",
  "created_at": "2024-06-24T03:49:27.655000Z",
  "data_removed": false,
  "error": null,
  "id": "p4txt87bwxrgg0cg91e9v6d79w",
  "input": {
    "text": "This is a test",
    "testMode": "false",
    "ref_audio_file": "https://www.renderai.com/audio/examples/bob-example-1.mp3",
    "ref_audio_transcript": "Space: the final frontier. These are the voyages of the starship enterprise. It's five year misssion: to explore strange new worlds; to seek out new life and new civilizations; to boldly go where no man has gone before."
  },
  "logs": ">>> Running inference\nWARNING:root:Reference audio duration is 20.06 > max suggested ref audio. Expect quality degradations. We recommend you trim prompt to be shorter than max prompt length.\nNote: using deep clone. Assuming input `c_phones` is concatenated prompt and output phones. Also assuming no padded indices in `c_codes`.\nNew x: torch.Size([1, 3022, 8]) | new x_known: torch.Size([1, 3022, 8]) . Base prompt: torch.Size([1, 1505, 8]). New padding mask: torch.Size([1, 3022]) | m shape: torch.Size([1, 3022, 8])\n>>>>> Done with inference",
  "metrics": {
    "predict_time": 107.894457702,
    "total_time": 189.732801
  },
  "output": "output.mp3",
  "started_at": "2024-06-24T03:50:49.493343Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/p4txt87bwxrgg0cg91e9v6d79w",
    "cancel": "https://api.replicate.com/v1/predictions/p4txt87bwxrgg0cg91e9v6d79w/cancel"
  },
  "version": "33a2ed337e20ecbb932a2c304ea42ba0903f0c11ba68421ca9676f79fae82317"
}

Image of output in Replicate UI:

Screenshot 2024-06-23 at 8 55 30 PM

When running cog predict locally, I can see that the file output.mp3 is successfully created. But despite existing, a URL is not produced. Just the relative path output.mp3

I also created a test mode on this model that, if enabled, skips inference and passes a file which I know to be on the filesystem (since it is included in the deploy). Same result.

In a later deploy, I also implemented tempfile as the docs suggest, in case that had something to do with it. The results are the same - a relative path, though this time, in the /tmp folder.

{
  "completed_at": "2024-06-24T06:32:48.388967Z",
  "created_at": "2024-06-24T06:29:21.517000Z",
  "data_removed": false,
  "error": null,
  "id": "h7njk9afxnrg80cg93qt648w2g",
  "input": {
    "text": "test",
    "testMode": "false",
    "ref_audio_file": "https://replicate.delivery/pbxt/L9PPFliYxJQY8PfbICObAygtaNOvupQ4Bv5p6siBWwMu1buR/output%20(13)%20trimmed.wav",
    "ref_audio_transcript": "Space: the final frontier. These are the voyages of the starship enterprise. It's five year misssion: to explore strange new worlds; to seek out new life and new civilizations; to boldly go where no man has gone before."
  },
  "logs": ">>> Running inference\nWARNING:root:Reference audio duration is 13.79 > max suggested ref audio. Expect quality degradations. We recommend you trim prompt to be shorter than max prompt length.\nNote: using deep clone. Assuming input `c_phones` is concatenated prompt and output phones. Also assuming no padded indices in `c_codes`.\nNew x: torch.Size([1, 2294, 8]) | new x_known: torch.Size([1, 2294, 8]) . Base prompt: torch.Size([1, 1035, 8]). New padding mask: torch.Size([1, 2294]) | m shape: torch.Size([1, 2294, 8])\n>>>>> Done with inference",
  "metrics": {
    "predict_time": 88.712041422,
    "total_time": 206.871967
  },
  "output": "/tmp/tmpi7sk5pv3/output.mp3",
  "started_at": "2024-06-24T06:31:19.676926Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/h7njk9afxnrg80cg93qt648w2g",
    "cancel": "https://api.replicate.com/v1/predictions/h7njk9afxnrg80cg93qt648w2g/cancel"
  },
  "version": "c79ffec219ab9c9a7c1a430b506cf30ae0da2fb75ae4a6a8f486c99833bcddc3"
}
mattt commented 1 month ago

@platform-kit It looks like Path is re-bound to pathlib.Path on this line:

https://github.com/Render-AI/MARS5-TTS/blob/f27cd6e99ac08033ca04d5450a3d36433e85d9f7/cog/predict.py#L4

What happens if you change the return type of predict to cog.Path?

platform-kit commented 1 month ago

That was indeed the issue. Sorry for the hassle. Not sure if the mars5-tts issue is related to the error I quoted about https://replicate.com/camenduru/lgm - but for what it's worth, that same error came up at some point when I was migrating the mars5-tts repo to return Path. So maybe it is somehow related to whatever cog.Path does since v0.9.6.

Or maybe it's simply a user error caused by some kind of change in the docs? But then again maybe not, since it seems old models are breaking.

arnavmehta7 commented 1 month ago

Hey @mattt when I try to use a file url (from default) then the cog errors out it gives out some URLFile error. It was all working fine few weeks back, not sure what's the issue now. And also randomly says "remote end disconnected"

arnavmehta7 commented 1 month ago

file link: https://files.catbox.moe/be6df3.wav

error:


Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/types.py", line 201, in __wrapped__
    return object.__getattribute__(self, "__target__")
AttributeError: 'URLFile' object has no attribute '__target__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connection.py", line 464, in getresponse
    httplib_response = super().getresponse()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/urllib3/connection.py", line 464, in getresponse
    httplib_response = super().getresponse()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/runner.py", line 400, in _predict
    input_dict[k] = v.convert()
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/types.py", line 136, in convert
    shutil.copyfileobj(self.fileobj, dest)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/shutil.py", line 192, in copyfileobj
    fsrc_read = fsrc.read
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/types.py", line 186, in __getattr__
    return getattr(self.__wrapped__, name)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/types.py", line 204, in __wrapped__
    resp = requests.get(url, stream=True)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))```