traindb-project / traindb-ml

Remote ML Model Serving Component for TrainDB
Apache License 2.0
6 stars 2 forks source link

Design the interface for DB connection during learning stage #2

Closed kihyuk-nam closed 2 years ago

kihyuk-nam commented 2 years ago

[AS-IS] (DBMS) --> CSV --> Learn

[TO-BE] DBMS or CSV --> Learn

Related to: https://github.com/traindb-project/traindb/issues/11#issue-1241519433

See: Architecture, Execution Flows

kihyuk-nam commented 2 years ago

Firstly, Dataset(not DataLoader) needs to be extended to accept info on dbconn

import psycopg2 as pg2 import pymysql as my

class DBDataset(Dataset):

def __init__(self, dbms, uri="mysql://nam@traindb.org:3306", db_name="instacart", credential, batch_size):
    if dbms == "mysql":
    else if dbms == "postgresql":
       self.conn=pg2.connect(
            database="instacart",user="postgres",password="1234",host="traindb.org",port="4321"
       )
     ...

def __get_item__(self, query, ...): self.conn.cursor().execute(query) # e.g. select * from order_products ... `

Secondly, tune it to remove any performance bottleneck. This is the tricky part. (ref. a simple test code from an unknown: https://github.com/fschur/SQL-Dataset-for-Pytorch)

taewhi commented 2 years ago

You can check JayDeBeApi for load data from DBMS. https://pypi.org/project/JayDeBeApi/

kihyuk-nam commented 2 years ago

cf.