Best way to stream the data from the uploaded file from the server to a database - Githubissues

marioaburto10 / single-page-streaming-app

A single page application that allows users to stream a dataset to a database and also stream the dataset from the database to the user.

0 stars 0 forks source link

Best way to stream the data from the uploaded file from the server to a database #3

Open marioaburto10 opened 5 years ago

marioaburto10 commented 5 years ago

After finding an efficient way to upload big files to a flask server, I now have to send this data to a database so that the user can query the database and retrieve desired records. Querying a database is faster and more efficient than looking through a file and locating desired records.

Observations

Sending a file chunk by chunk to an sqlite database took about an hour which is not very efficient and a terrible user experience. This can be seen in my jupyter_notebooks folder here. The user should not have to wait an hour for the file to be streamed to a database.

marioaburto10 commented 5 years ago

One approach is to use task queues because they provide a convenient solution for an application to request the execution of a task by a worker process. Worker processes run independently of the application and will speed up the process of sending a file to a database (hopefully) by using a pool of worker processes.

Two popular task queues for python are Celery and Redis Queue . I will attempt this approach using this example as a reference.

marioaburto10 commented 5 years ago

Also, I think I will be using SQLite to start but then I will move to a more robust database like MySQL or POSTGRES.

marioaburto10 commented 5 years ago

Update, after researching intensively I have found a different, more efficient approach using pyspark. Apache Spark is written in Scala so pyspark is the Python API for Spark which allows full Spark integration using python. Pyspark has two modules (pyspark.streaming and pyspark.sql) that will be perfect for doing exactly what I need here, streaming and storing/retrieving to/from a database. Pyspark is also great for Big Data which is what I will be working with (currently dealing with a csv file of size 6.9 GB with ~5 million rows).

Next step is finding out how to use pyspark to stream data from the uploaded csv to a database like MySQL or PostgreSQL from my flask server.
- Important question to consider: Will I still need worker processes or asyncio?

marioaburto10 commented 5 years ago

After doing A LOT of research, I found this example that explains how to read a large csv file using spark, create a table using the csv schema, and then stream data from the file to the newly created table/database. That example is written in Sacala and Java and uses Azure SQL but after reading through the pyspark docs I was able to translate most of it to python. I will be testing it out in the jupyter notebook spark_write_mysql.ipynb first before implementing it on the flask server.

marioaburto10 commented 5 years ago

Some issues along the way were:

removing spaces from the column names because they were too long, and even then one column was still giving error An error occurred while calling jdbc. : java.sql.SQLSyntaxErrorException: Identifier name is too long. So I had to rename it
adding enableHiveSupport() to my initial Spark session in order to be able to create a temp table to be able to create a mysql table.
Downloading and pointing to the mysql jdbc driver so that I can connect Spark with a MySQL database.

marioaburto10 commented 5 years ago

So far, I am able to read the csv in spark, get the schema from the csv file and create a temporary Hive table. I'm getting an error when trying to create the table on MySQL because of an error: Row size too large (> 8126). Changing some columns to TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline.. It seems to be a common error in MySQL and I have attempting changing the global variables in my my.cnf by adding these three lines: innodb_strict_mode = 0 innodb_log_file_size = 512M innodb_file_per_table=1 but I am still getting the error. I have tried restarting my jupyter notebook kernel and mysql server but the issue persists. I have closed every process down and exit every program. I will shut my computer down and then start fresh again in the morning. Worst case scenario I'll install an earlier version of MySQL because I'm using version 8.0 which has a lot of changes compared to the earlier 5.X and 6.X versions (and also less online support).

Once I can fully create my MySQL table then I can begin streaming data into it.

marioaburto10 commented 5 years ago

I uninstalled and removed all instances of mysql version 8.0 and installed version 5.7 via homebrew but it never let me run the server because it could not locate the pid file even after I created it and made sure to allow read and write access to the directories from mysql. Long story short, I spent hours just trying to get the mysql server to run and nothing really changed. I removed and uninstalled version 5.7 and reinstalled version 8.0 but I am still getting the error java.sql.BatchUpdateException: Row size too large (> 8126). Changing some columns to TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline. This is new in version 8.0 and cannot be changed.

I will try to either downgrade to an earlier version or see how I can go around this issue.

marioaburto10 commented 5 years ago

I downgraded to 5.7 but I am still getting the same error. Not sure as of now how to get around this error being that I am dynamically getting the schema from the uploaded file and not making it myself.