decoupleR::get_collectri 403 error

mfranke-2 commented 4 months ago

Hi, I have been using CollecTRI successfully until recently, when I started receiving the following error:

decoupleR::get_collectri(organism="human" split_complexes=FALSE)

[2024-06-04 21:20:36] [WARN] [OmnipathR] HTTP 403 [2024-06-04 21:20:36] [WARN] [OmnipathR] Failed to download "https://www.ensembl.org/info/about/species.html" (attempt 1/3); error: HTTP 403 [2024-06-04 21:20:42] [WARN] [OmnipathR] HTTP 403 [2024-06-04 21:20:42] [WARN] [OmnipathR] Failed to download "https://www.ensembl.org/info/about/species.html" (attempt 2/3); error: HTTP 403 [2024-06-04 21:20:47] [WARN] [OmnipathR] HTTP 403 [2024-06-04 21:20:47] [ERROR] [OmnipathR] Failed to download "https://www.ensembl.org/info/about/species.html" (attempt 3/3); error: HTTP 403 Error in 'map_int()': In index: 1. Caused by error in 'map_int()': In index: 1. Caused by error: ! HTTP 403

Any help would be greatly appreciated!

Best, Megan

sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] dplyr_1.1.4 dorothea_1.14.1 OmnipathR_3.13.4 decoupleR_2.9.7

smuellerd commented 4 months ago

@deeenes, could you have a look at this?

mfranke-2 commented 3 months ago

Hi! I just wanted to follow up on this - It seems that the ensembl server is checking for certain headers to determine if it's a browser request or request from code (and it's blocking requests from code). Maybe adding a header argument when reading the .html file could fix this?

deeenes commented 3 months ago

@mfranke-2 Thanks for the tip, I'll look into it! Though now I managed to download it with the default headers of cli curl. There is something weird going on with ensembl.org, since the update of their ssl certificate last month the CI under OSX fails. It's quite possible the two issues are the same.

Did you experience the issue in different times, on different networks and computers, or it happened only once? Have you tried it again since then?

mfranke-2 commented 3 months ago

Interesting, yes I think you're right that the two issues are related or the same! In R, something as simple as the following yields the 403 error for me: download.file("https://useast.ensembl.org/info/about/species.html", destfile = "test.txt", headers = c("User-Agent" = "My Custom User Agent"))

but changing the header argument fixes the issue:

download.file("https://useast.ensembl.org/info/about/species.html", destfile = "test.txt", headers = c("User-Agent" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15"))

I've tried it on two different computers and received the same error. I also tried to run again this morning and again had the error.

deeenes commented 3 months ago

Hi @mfranke-2, Many thanks for finding this out! Now I played around a bit, and I didn't manage to reproduce the issue from within the package. Still, it's clear that ensembl.org gives the HTTP 403 error depending on the User-Agent. In OmnipathR, httr::GET performs the download, which relies on curl under the hood. Its default user agent is libcurl/8.5.0 r-curl/5.2.1 httr/1.4.7. I've tried several things, but long story short, apparently Ensembl accepts the requests if it sees certain keywords in the user agent, such as "curl":

r <- download.file(
    "https://useast.ensembl.org/info/about/species.html",
    destfile = "test.txt"
    headers = c("User-Agent" = "curl"),
    method = 'libcurl'
)
trying URL 'https://useast.ensembl.org/info/about/species.html'
downloaded 238 KB

r <- download.file(
    "https://useast.ensembl.org/info/about/species.html",
    destfile = "test.txt"
    headers = c("User-Agent" = "cur"),
    method = 'libcurl'
)
trying URL 'https://useast.ensembl.org/info/about/species.html'
Error in download.file("https://useast.ensembl.org/info/about/species.html",  : 
  cannot open URL 'https://useast.ensembl.org/info/about/species.html'
In addition: Warning messages:
1: In download.file("https://useast.ensembl.org/info/about/species.html",  :
  downloaded length 0 != reported length 0
2: In download.file("https://useast.ensembl.org/info/about/species.html",  :
  cannot open URL 'https://useast.ensembl.org/info/about/species.html': HTTP status was '403 Forbidden'

It also depends on the mirror, the useast mirror reproduces the error, while the www one doesn't:

r <- download.file(
    "https://www.ensembl.org/info/about/species.html",
    destfile = "test.txt"
    headers = c("User-Agent" = "cur"),
    method = 'libcurl'
)
trying URL 'https://www.ensembl.org/info/about/species.html'
downloaded 238 KB

This latter is used in OmnipathR, I don't know if redirect might happen at all, though the appearance of this error suggests it does.

As a solution, I set OmnipathR to use a browser like user agent in all queries to Ensembl, configurable by options("omnipath.user_agent").

mfranke-2 commented 3 months ago

@deeenes It works again! Thank you so much!

saezlab / CollecTRI

decoupleR::get_collectri 403 error #19