GoldrickD commented 3 years ago

Connect PySpark directly to the DB table, so that once that cell is run in the notebook it is pulling constantly updated data instead of exporting the CSV. This allows the ML to be completed concurrently with data collection.

GoldrickD commented 3 years ago

Awaiting DB information to finalize pipeline

jimjim975 commented 3 years ago

conn = pyodbc.connect(DRIVER = '{ODBC Driver 17 for SQL Server}', SERVER = 'azure.openwit.tech, 1433', DATABASE = 'openwitprod', UID = 'sa', PWD = 'Seniorproject123!'); cursor = conn.cursor();

Pull from DB directly

defines schema for dataframe entries (each structfield is a different column in database)

schema = StructType([ StructField('timestamp', StringType(), True),\ StructField('age', StringType(), True),\ StructField('gender', StringType(), True),\ StructField('gradyear', StringType(), True),\ StructField('major', StringType(), True),\ StructField('minor', StringType(), True),\ StructField('withdrawals', StringType(), True),\ StructField('collegechoice', StringType(), True),\ StructField('campusliving', StringType(), True),\ StructField('exercise', StringType(), True),\ StructField('employed', StringType(), True),\ StructField('coops', StringType(), True),\ StructField('extracurricular', StringType(), True),\ StructField('classnum', StringType(), True),\ StructField('timeoutfriends', StringType(), True),\ StructField('vaccinated', StringType(), True),\ StructField('classesfailed', StringType(), True),\ StructField('freetime', StringType(), True),\ StructField('druguse', StringType(), True),\ StructField('cwentworthr', StringType(), True),\ StructField('hstudying', StringType(), True),\ StructField('GPA', StringType(), True),\ StructField('overallexp', StringType(), True),\ StructField('mentalhealth', StringType(), True),\ StructField('wentresources', StringType(), True),\ StructField('firstgen', StringType(), True),\ StructField('terms', StringType(), True),\ StructField('sleep', StringType(), True)\ ])

df = spark.createDataFrame([], StructType([]))

cursor.execute("select * from openwitprod.dbo.ProdTableNew"); df = cursor.fetchall();

print(df)

df2 = spark.createDataFrame((tuple(r) for r in df), schema) df2.show()

jimjim975 / OpenWIT

DB connection to PySpark #14

Pull from DB directly

defines schema for dataframe entries (each structfield is a different column in database)

df = spark.createDataFrame([], StructType([]))

print(df)