Closed GoldrickD closed 3 years ago
Awaiting DB information to finalize pipeline
conn = pyodbc.connect(DRIVER = '{ODBC Driver 17 for SQL Server}', SERVER = 'azure.openwit.tech, 1433', DATABASE = 'openwitprod', UID = 'sa', PWD = 'Seniorproject123!'); cursor = conn.cursor();
schema = StructType([ StructField('timestamp', StringType(), True),\ StructField('age', StringType(), True),\ StructField('gender', StringType(), True),\ StructField('gradyear', StringType(), True),\ StructField('major', StringType(), True),\ StructField('minor', StringType(), True),\ StructField('withdrawals', StringType(), True),\ StructField('collegechoice', StringType(), True),\ StructField('campusliving', StringType(), True),\ StructField('exercise', StringType(), True),\ StructField('employed', StringType(), True),\ StructField('coops', StringType(), True),\ StructField('extracurricular', StringType(), True),\ StructField('classnum', StringType(), True),\ StructField('timeoutfriends', StringType(), True),\ StructField('vaccinated', StringType(), True),\ StructField('classesfailed', StringType(), True),\ StructField('freetime', StringType(), True),\ StructField('druguse', StringType(), True),\ StructField('cwentworthr', StringType(), True),\ StructField('hstudying', StringType(), True),\ StructField('GPA', StringType(), True),\ StructField('overallexp', StringType(), True),\ StructField('mentalhealth', StringType(), True),\ StructField('wentresources', StringType(), True),\ StructField('firstgen', StringType(), True),\ StructField('terms', StringType(), True),\ StructField('sleep', StringType(), True)\ ])
cursor.execute("select * from openwitprod.dbo.ProdTableNew"); df = cursor.fetchall();
df2 = spark.createDataFrame((tuple(r) for r in df), schema) df2.show()
Connect PySpark directly to the DB table, so that once that cell is run in the notebook it is pulling constantly updated data instead of exporting the CSV. This allows the ML to be completed concurrently with data collection.