minio / minio-py

MinIO Client SDK for Python
https://docs.min.io/docs/python-client-quickstart-guide.html
Apache License 2.0
851 stars 325 forks source link

loading the content from get_object() into a dataframe without saving the file #1312

Closed JMT800 closed 1 year ago

JMT800 commented 1 year ago

Hello Minio Community,

Am opening this issue cause i was trying to load a csv that i stored onto a minio bucket into a python dataframe via the minio package based on the get_object() function.

My code looks like the following response = client.get_object(bucket_name="bucket-name", object_name="filename.csv") df = pd.read_csv(io.BytesIO(response.data)) However am getting the error signaling that the file is empty pandas.errors.EmptyDataError: No columns to parse from file

ps: i am sure that the file is not empty cause i also tried to save it first via fget_object() and load the saved file into a dataframe and it worked however my goal is to load the data into a dataframe without saving it first.

any help is appreciated. cheers

balamurugana commented 1 year ago

Refer https://urllib3.readthedocs.io/en/stable/reference/urllib3.response.html#urllib3.response.BaseHTTPResponse how to read data from response. If pd.read_csv() doesn't work with stream, you have no option other than reading entire data into memory or using temporary file.

JMT800 commented 1 year ago

thanks for the quick reply @balamurugana, i dont think it is a problem with pd.read_csv(). whenever i try to read the data object,based on the above url by using response.data orresponse.json(), provided by the client.get_object() method - the response seems to be empty. Hence i think, it might be a problem with the get_object() function. Have u ever encountered such a dilemna with get_object() ?

thanks in advance,

JMT800 commented 1 year ago

hello, yes the solution was to use cache_content=True with the response.read() function following

         try:
        response = client.get_object(bucket_name="bucketname", object_name=name)
        # Read data from response.
        while True:
            data = response.read(cache_content=True)  
            if not data:
                break
            print (response.data)
            df = pd.read_csv(BytesIO(response.data),sep=";")
    finally:
        response.close()
        response.release_conn()

thanks and cheers