markhoerth / dremio_client

Apache License 2.0
31 stars 25 forks source link

how to get schema of the dataset using dremio flight connector #136

Closed vikashsingh009 closed 3 years ago

vikashsingh009 commented 4 years ago

Description

I am trying to ser dremio flight client to query flight data,but data what i recieved from the dremio flight is missing schema ,i need schema to bind on the map ,i able to get same scheam with Data using dremio simple python client

What I Did

i have written below code for python for flight

from flask_cors import CORS
import pandas as pd
#from dremio_client import init
from dremio_client.dremio_client import query
#import pandas as pd
import json
import urllib3
# Load an empty map
from keplergl import KeplerGl
from flask import Flask, request, jsonify
app = Flask(__name__)
app.config["DEBUG"] = True

CORS(app)

@app.route('/api/', methods=['POST', 'GET'])
def api_post():
    if request.method == 'POST':
        print('post app')
        req = request.json
        username = 'visur'
        password = '**************'
        hostname = '************'
        port = '47470'
        sql = '''select * from visurtest_Space.Datasets.FacilityCodeProductionDayRecord'''
        tlsfilename = 'ba04da291ec01ee6.crt'
        output = query(sql =sql, hostname=hostname, port=port, username=username, password=password,tls_root_certs_filename=tlsfilename)
        pandaataframe=pd.DataFrame(output)
        x = json.dumps(json.loads(pandaataframe.to_json(orient='records')), indent=2)
        print(x)
        df1 = output.to_json()
        y = json.loads(df1)
        return x
    else:
        return 'not OK'
app.run(host='0.0.0.0', port=5000

on the above code i am just getting data which i converted to json array,but i need to get data as well as Schema so that i can use in maps

here is the desired output i want to achive

image

rymurr commented 4 years ago

sorry for the delay @vikashsingh009

The schema information exists in the flight info found here: https://github.com/rymurr/dremio_client/blob/master/dremio_client/flight/__init__.py#L83

Probably the easiest way to get the schema would to call the flight pyarrow functions directly rather than use the dremio client helper function. Unless you want to add a function to fetch the schema, happy to accept PRs :-)

vikashsingh009 commented 4 years ago

@rymurr ,i have followed above steps to get flight info schema ,i able to get the schema and trying to convert it to json
and send back to rest api response for that i have written following rest api

from flask_cors import CORS
from dremio_client.dremio_client import query
from pyarrow import flight
import json
import pandas as pd
import pyarrow as pa
# Load an empty map
from keplergl import KeplerGl
from flask import Flask, request, jsonify
app = Flask(__name__)
app.config["DEBUG"] = True

CORS(app)
class HttpDremioClientAuthHandler(flight.ClientAuthHandler):

  @app.route('/api/', methods=['POST', 'GET'])  
  def api_post():
    if request.method == 'POST':
            print('post app')
            username = ''
            password = ''
            hostname = ''
            port = '47470'
            sql = '''select * from Test_Space.Datasets.FacilityCode'''
            tlsfilename = 'ba04da291ec01ee6.crt'
            #output = query(sql =sql, hostname=hostname, port=port, username=username, password=password,tls_root_certs_filename=tlsfilename)
            output = query(token= None,sql =sql, hostname=hostname, flight_port=port, username=username, password=password,odbc_port=None,ssl_verify=False)
            pandaataframe=pd.DataFrame(output)
            x = json.dumps(json.loads(pandaataframe.to_json(orient='records')), indent=2)
            client = flight.FlightClient('grpc+tcp://localhost:47470')
            client.authenticate(HttpDremioClientAuthHandler(username, password)) 
            info = client.get_flight_info(flight.FlightDescriptor.for_command(sql))
            reader = client.do_get(info.endpoints[0].ticket)
            batches = []
    while True:
        try:
             batch, metadata = reader.read_chunk()
             batches.append(batch)
        except StopIteration:
          break
        data = pa.Table.from_batches(batches)
            #panda data frame
        pdf = data.to_pandas()
        schema = pa.Schema.from_pandas(pdf)
        print(schema)
        schemadata= json.dumps(info.schema)
        print(info.schema)
        jsondata ={"rows":x,"schema":schemadata}
        return  jsonify(jsondata)
    else:
         return 'not OK'

    def __init__(self, username, password):
        super(flight.ClientAuthHandler, self).__init__()
        self.basic_auth = flight.BasicAuth(username, password)
        #self.tls_certificates = 'ba04da291ec01ee6.crt'
        self.token = None

    def authenticate(self, outgoing, incoming):
        auth = self.basic_auth.serialize()
        outgoing.write(auth)
        self.token = incoming.read()
        return self.token

    def get_token(self):
        return self.token

app.run(host='0.0.0.0', port=5000)    

But ,when i am sending request on API i am geeting below error

TypeError: Object of type Schema is not JSON serializable

image

rymurr commented 4 years ago

Hey @vikashsingh009

The schema object cant be converted to json and isn't really suitable as a json representation: https://lists.apache.org/thread.html/029b486810af851e7112488ae95e9add59919be826b5ad794b993614%40%3Cdev.arrow.apache.org%3E

So I would try and figure out a different format to pass the schema or hand convert it using https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html

vikashsingh009 commented 4 years ago

thanks @rymurr i am eagerly waiting for your response