Cloud only: Proxy for URLs

Jonathan-Adly commented 2 weeks ago

Some sites block data center IPs, so we can't download the documents. We need to add a proxy for the URL before downloading.

Jonathan-Adly commented 2 weeks ago

I will keep this issue open as I am not fully committed on won't fix vs. proxy url vs. commit to take whatever URL the user sends. Most medical publishers have cloudflare on, which still blocks with proxy. (even if the study is open and free to download).

Proxies aren't that useful because of small file limitations, and committing to take whatever URL the user sends means instead of focusing on AI pipelines, we would be play grayhat again cloudflare.

For now - added better error messages so people know why it didn't work.

Jonathan-Adly commented 1 week ago

I spoke with the folks at Scraper API. They will support any ongoing issues for us, so we can commit to accepting whatever the users send.

@Abdullah13521 Here is the plan.

We will combine get_url_info and fetch_document into a single GET call. Since a head request is not well supported, and its over-engineered anyway to optimize for edge-cases. This will happen in the main branch
We will add an optional proxy url in the settings - also in main. If not provided, it will be None
In upsert, we will have have an optional use_proxy Boolean - defaults to False.
If use_proxy is True, we will use the proxy URL in the settings.
Aiohttp support for proxies is really good.

# example proxy_url =  "http://scraperapi.ultra_premium=true:my-api-key@proxy-server.scraperapi.com:8001"
# proxy could be none, and it won't through any errors. 
async with aiohttp.ClientSession() as session:
        async with session.get(url, proxy=proxy, ssl=False) as response:

If Sentry_DSN is enabled and the download failed - kick it over to Sentry so we can look into and get support for it.

On cloud - if use_proxy is used, we will add 10 credits to usage. (so, number of pages + 10).

Let me know if you have any questions.

tjmlabs / ColiVara

Cloud only: Proxy for URLs #76