sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.81k stars 467 forks source link

Timeout api.colabfold.com server #606

Open rukibuki opened 3 months ago

rukibuki commented 3 months ago

Lately, when we try to submit multiple jobs (max 50 per run) to api.colabfold.com (via the alphapulldown package using mmseqs2) we are hit with: W0416 18:39:07.143900 139828968597312 colabfold.py:86] Timeout while submitting to MSA server. Retrying...

for all of the runs and none of them are able to connect within hours (I canceled the run after 10 hours).

While such a job is running, and I type: "nmap -Pn -p 80 (or 443) api.colabfold.com" it shows that port 80 and 443 are filtered. PORT STATE SERVICE 80/tcp filtered http

We have been informed by our IT department that they are not filtering port 443 or 80, which is also what we can see when the above job is not running, then we get (here for 443 but same for 80):

PORT STATE SERVICE 443/tcp open https

Today I tried submitting 50 jobs again, same problem, but if I instead submitted one job at a time the server did not throw the timeout error.

So is there a maximum number of jobs we can submit simultaneously? if so what is that number? Is it maybe possible to have our IP whitelisted to allow us to submit larger jobs, than whatever the limit is?

please let me know if you need any other information from me.

milot-mirdita commented 3 months ago

Could you share (or email me) the IP from where you are sending?

Generally it should not time-out but either return a 403 or 429 HTTP error (instantly) if you are banned or temporarily banned.

rukibuki commented 3 months ago

yes certainly, the IP out from us should be: 130.225.18.30

milot-mirdita commented 3 months ago

I don’t think I have had to ban a danish IP before. I don’t think that’s the problem (not in front of a computer to check right now though).

what does dig api.colabfold.com (when executed from the failing compute node) say?

it’s most likely a DNS error, not idea why though

rukibuki commented 3 months ago

@vader9 ~]$ dig api.colabfold.com

; <<>> DiG 9.11.36-RedHat-9.11.36-11.el8_9 <<>> api.colabfold.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62568 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;api.colabfold.com. IN A

;; ANSWER SECTION: api.colabfold.com. 60 IN A 147.46.145.74

;; Query time: 473 msec ;; SERVER: 10.83.252.137#53(10.83.252.137) ;; WHEN: Wed Apr 17 10:54:02 CEST 2024 ;; MSG SIZE rcvd: 62

milot-mirdita commented 3 months ago

I don't see any reason why it should time-out. The DNS response also looks fine.

Does curl https://api.colabfold.com/queue work?

rukibuki commented 3 months ago

[rtk@vader9 ~]$ curl https://api.colabfold.com/queue {"queued":0}

So yes it seems to work fine. I have now tried submitting 5 runs at a time without any problems. I might edge this upwards every time to see where the limit is.

It has nothing to do with your local IT department. like it look like a potential DDOS attack or something like that when I submit 50 jobs at once? Or is that maybe standard practice or maybe even a low amount of runs compared to others?

milot-mirdita commented 3 months ago

If you submit 50 jobs at once you should start getting HTTP 429 error that ColabFold will understand to automatically retry later.

It should never time out. That behavior is very puzzling.

I have not asked our network management team, but I would not expect this to be an issue, since there are heavier API users than this.

rukibuki commented 3 months ago

we normally saw this: I0403 14:05:12.905370 140497882613568 objects.py:208] input is features/Q96DT5.a3m

0%| | 0/150 [elapsed: 00:00 remaining: ?] SUBMIT: 0%| | 0/150 [elapsed: 00:00 remaining: ?]E0403 14:05:14.012090 140497882613568 colabfold.py:164] Sleeping for 8s. Reason: RATELIMIT E0403 14:05:22.915350 140497882613568 colabfold.py:164] Sleeping for 5s. Reason: RATELIMIT

but if we are not among the top heavy api-users with 50 calls, then I will try to increase the 5 runs to maybe 10 and see if that works. 10 should be more than enough for now.

milot-mirdita commented 3 months ago

Ah, that makes more sense. That's not a timeout, but a rate limit and intended behavior.

So how the system currently works is that you get 20 "tokens" for job submissions and the tokens are replenished at a rate of 0.01111111111111 per second (or 1 per 90s), where you can submit another job. It doesn't replenish above 20.

Thus you can use the API for 40-60 MSAs per hour.

We have the colabfold_search script for local searches to run more MSAs on your own resources. I am not sure how AlphaPulldown handles local searches, but I think they also have something to run MMseqs2 locally.

rukibuki commented 3 months ago

So What I wrote in my last comment was what we normally saw when submitting 50 runs at one time. But what we got recently was what I wrote in the original post, which was a timeout a run that was left idle for a long time. Sorry for the confusion!

But what you just wrote with 20 tokens and replenish makes a lot of sense for what we normally see.

But for now the timeout problem is not an issue as long as we don't go to high in run numbers.