tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.28k stars 1.53k forks source link

option to be be able to use datasets behind a proxy #275

Closed tarrade closed 5 years ago

tarrade commented 5 years ago

Is your feature request related to a problem? Please describe. Right now behing a proxy, it is not working:


ds_train = tfds.load(name="cats_vs_dogs", split=tfds.Split.TRAIN)

C:\Program Files\Anaconda3\envs\env_gcp_dl_2_0_ds\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    514                 raise SSLError(e, request=request)
    515
--> 516             raise ConnectionError(e, request=request)
    517
    518         except ClosedPoolError as e:
ConnectionError: HTTPConnectionPool(host='storage.googleapis.com', port=80): Max retries exceeded with url: /tfds-data/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000192F3D06668>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

I don't think this is supported for now (I didn't see it in the documentation): https://www.tensorflow.org/datasets/api_docs/python/tfds/load

This will impact quite a lot of people working in company and university

Describe the solution you'd like I am not an expert but using request seems to be the standard way. Below on example from a Google GCP tool:

from google.cloud import storage
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS']=xxx
os.environ['HTTPS_PROXY']=xxx
os.environ['REQUESTS_CA_BUNDLE']=/xxx/xxx
client = storage.Client()

ignore the GOOGLE_APPLICATION_CREDENTIALS' whihc is specific to GCP. The user need to setup one or 2 env variables and everything is done in the backgroud (I guess this is using requests)

http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

captain-pool commented 5 years ago

@rsepassi @Conchylicultor @cyfra Is this issue fixed? if not, can you assign this to me?

Conchylicultor commented 5 years ago

It should be fixed with @captain-pool contribution #488

tarrade commented 5 years ago

Thanks @captain-pool

"To Configure the Proxy Settings, The User needs to set the Proxies for HTTP, HTTPS and FTP in the Environment Variables TFDS_HTTP_PROXY, TFDS_HTTPS_PROXY, TFDS_FTP_PROXY respectively."

Do you also have an option to pass a CA certificate for SSL ?

Right now it is crahsing with :

requests.exceptions.SSLError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /record/53169/files/Kather_texture_2016_image_tiles_5000.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))

This is typical of SSL interception and you need to put SSL verify false (if possible) or simply pass the CA certificate. Did you implement a REQUESTS_CA_BUNDLE environment variable as well ? Wich lib is used in your implementation ? Request ?

captain-pool commented 5 years ago

Hey @tarrade the downloader uses both requests and urllib. And Sorry, I totally missed the feature request for CA file. I just made it flexible for Proxies. Will add the support for CA Certificates ASAP.

captain-pool commented 5 years ago

@Conchylicultor should I skip the certificate verification by passing CERT_NONE from ssl, or should I put an option for adding certificate file?

tarrade commented 5 years ago

Hi @captain-pool , no problem. I know it is only compny that are using proxy and CA certificate and we are suffering from that everyday. I will be happy to test it when you have it ready. Just tell me in which nithly build it was collected. Thanks

captain-pool commented 5 years ago

Can you re open the issue?

On Mon, 10 Jun 2019, 9:19 pm Dr. Fabien Tarrade, notifications@github.com wrote:

Hi @captain-pool https://github.com/captain-pool , no problem. I know it is only compny that are using proxy and CA certificate and we are suffering from that everyday. I will be happy to test it when you have it ready. Just tell me in which nithly build it was collected. Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/275?email_source=notifications&email_token=ADKYRWJRBRNKJYOM7NUBFLLPZZZZBA5CNFSM4G7G2XT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKIWOQ#issuecomment-500468538, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKYRWMNOKF4RRXCHTPAFC3PZZZZBANCNFSM4G7G2XTQ .

tarrade commented 5 years ago

On my side I cannot reopen this ticket. By the way I forgot thank you for the implementation of this request. You will help a lot of people using company laptop

captain-pool commented 5 years ago

@tarrade I think #663 should fix this. Give a Check

captain-pool commented 5 years ago

cc: @Conchylicultor

tarrade commented 5 years ago

Hi @captain-pool, it seems the build failed, right ? https://source.cloud.google.com/results/invocations/e044b82b-65e9-4b34-9d5f-abd96aaba0a8/targets/tensorflow_datasets%2Fgh_testing%2Fpresubmit/log

I tested with 1.0.2.dev201906110105 but it is still failling with "bad handshake"

If the fix is already in 1.0.2.dev201906110105, then I will investiagte that I have all ca certificates in my file

captain-pool commented 5 years ago

@tarrade it is failing because I'm using SSL Context which is supported from python 2.7.9, however, Kokoro is using a version <= python 2.7.8, which doesn't allow that. Let me find out an alternative, will fix it soon. @Conchylicultor @rsepassi @vbardiovskyg @cyfra is it possible to upgrade Kokoro's configuration for python 2 to python 2.7.9 ?

cyfra commented 5 years ago

@rsepassi is the expert here, but from what I see it might not be that easy :-( As we'd have to move from the "common" kokoro cluster/image to custom one (and pay the cost of managing it).

I see other places in our code, where we had to do workarounds in the past, to accommodate the fact that linux machines on kokoro use 2.7.8.

Would it make sense to have this feature "disabled" if running on old python version ?

captain-pool commented 5 years ago

@rsepassi is the expert here, but from what I see it might not be that easy :-( As we'd have to move from the "common" kokoro cluster/image to custom one (and pay the cost of managing it).

I see other places in our code, where we had to do workarounds in the past, to accommodate the fact that linux machines on kokoro use 2.7.8.

Would it make sense to have this feature "disabled" if running on old python version ?

Done :) Disabling for python version <= 2.7.8 seemed like the only valid way out. The Builds are passing. @tarrade after @cyfra verifies and merges, it should be ready :)

tarrade commented 5 years ago

@captain-pool good idea to disabling for python version <= 2.7.8. I am quite new in this business how can I see in which build this fix was collected ? It is alread in tfds-nightly==1.0.2.dev201906120105 or should I wait in the one from tomorrow ?

captain-pool commented 5 years ago

You need to wait till it merges to master branch and you can get it in the nightly build the next day.

Or if it is too urgent. You can clone it from my fork. Then cd into the local repository and git checkout issue_275. Finally:

  1. "pip install ." OR
  2. "python setup.py install"

Any one of these will do the job :)

On Wed, 12 Jun 2019, 2:31 pm Dr. Fabien Tarrade, notifications@github.com wrote:

@captain-pool https://github.com/captain-pool good idea to disabling for python version <= 2.7.8. I am quite new in this business how can I see in which build this fix was collected ? It is alread in tfds-nightly==1.0.2.dev201906120105 or should I wait in the one from tomorrow ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/275?email_source=notifications&email_token=ADKYRWIGFM6PNFGMBUSDDZDP2C3N3A5CNFSM4G7G2XT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPXRWQ#issuecomment-501184730, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKYRWMZJQD3SNSY77BWGADP2C3N3ANCNFSM4G7G2XTQ .

tarrade commented 5 years ago

I tested the latested build 1.0.2.dev201906180105 and I confirm that it is working with proxy and CA certificate.

Here my test and setup:

export TFDS_HTTPS_PROXY="http://user:password@ip:port/"
export TFDS_CA_BUNDLE=path/ca_certs

It is working for the following dataset:

dataset = tfds.load(name="colorectal_histology_large", split=tfds.Split.TREST)
dataset = tfds.load(name="colorectal_histology", split=tfds.Split.TRAIN)

I have some crashes when the dataset in is on AWS:

tfds.load(name="fashion_mnist", split=tfds.Split.TRAIN)

requests.exceptions.ConnectionError: HTTPConnectionPool(host='fashion-mnist.s3-website.eu-central-1.amazonaws.com', port=80): Max retries exceeded with url: /train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f22bc27df98>: Failed to establish a new connection: [Errno 110] Connection timed out',))

I don't know what is the issue with AWS. Manually I can dowmload the file. I need to retry later. I am in a conf with a not so great network.

Overall it is working. The questions is on which side is the issue with AWS.

tarrade commented 5 years ago

of course, I need to add both:

export TFDS_HTTPS_PROXY="http://user:password@ip:port/"
export TFDS_HTTP_PROXY="http://user:password@ip:port/"

and then everything is working fine.

All is working perfectly. Thanks @captain-pool . Closing

shm007g commented 2 years ago

Use this

dl_config = tfds.download.DownloadConfig(verify_ssl=False)  # Do this shit, or you get a request error!
examples, metadata = tfds.load('cnn_dailymail', with_info=True,
                               as_supervised=True,
                               download_and_prepare_kwargs={'download_config': dl_config})