Open marioaburto10 opened 5 years ago
One approach is to use task queues because they provide a convenient solution for an application to request the execution of a task by a worker process. Worker processes run independently of the application and will speed up the process of sending a file to a database (hopefully) by using a pool of worker processes.
Two popular task queues for python are Celery and Redis Queue . I will attempt this approach using this example as a reference.
Also, I think I will be using SQLite to start but then I will move to a more robust database like MySQL or POSTGRES.
Update, after researching intensively I have found a different, more efficient approach using pyspark. Apache Spark is written in Scala so pyspark is the Python API for Spark which allows full Spark integration using python. Pyspark has two modules (pyspark.streaming and pyspark.sql) that will be perfect for doing exactly what I need here, streaming and storing/retrieving to/from a database. Pyspark is also great for Big Data which is what I will be working with (currently dealing with a csv file of size 6.9 GB with ~5 million rows).
After doing A LOT of research, I found this example that explains how to read a large csv file using spark, create a table using the csv schema, and then stream data from the file to the newly created table/database. That example is written in Sacala and Java and uses Azure SQL but after reading through the pyspark docs I was able to translate most of it to python. I will be testing it out in the jupyter notebook spark_write_mysql.ipynb
first before implementing it on the flask server.
Some issues along the way were:
An error occurred while calling jdbc. : java.sql.SQLSyntaxErrorException: Identifier name is too long
. So I had to rename itenableHiveSupport()
to my initial Spark session in order to be able to create a temp table to be able to create a mysql table. So far, I am able to read the csv in spark, get the schema from the csv file and create a temporary Hive table. I'm getting an error when trying to create the table on MySQL because of an error: Row size too large (> 8126). Changing some columns to TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline.
. It seems to be a common error in MySQL and I have attempting changing the global variables in my my.cnf
by adding these three lines: innodb_strict_mode = 0 innodb_log_file_size = 512M innodb_file_per_table=1
but I am still getting the error. I have tried restarting my jupyter notebook kernel and mysql server but the issue persists. I have closed every process down and exit every program. I will shut my computer down and then start fresh again in the morning. Worst case scenario I'll install an earlier version of MySQL because I'm using version 8.0 which has a lot of changes compared to the earlier 5.X and 6.X versions (and also less online support).
Once I can fully create my MySQL table then I can begin streaming data into it.
I uninstalled and removed all instances of mysql version 8.0 and installed version 5.7 via homebrew but it never let me run the server because it could not locate the pid file even after I created it and made sure to allow read and write access to the directories from mysql. Long story short, I spent hours just trying to get the mysql server to run and nothing really changed. I removed and uninstalled version 5.7 and reinstalled version 8.0 but I am still getting the error java.sql.BatchUpdateException: Row size too large (> 8126). Changing some columns to TEXT or BLOB may help. In current row format, BLOB prefix of 0 bytes is stored inline.
This is new in version 8.0 and cannot be changed.
I will try to either downgrade to an earlier version or see how I can go around this issue.
I downgraded to 5.7 but I am still getting the same error. Not sure as of now how to get around this error being that I am dynamically getting the schema from the uploaded file and not making it myself.
After finding an efficient way to upload big files to a flask server, I now have to send this data to a database so that the user can query the database and retrieve desired records. Querying a database is faster and more efficient than looking through a file and locating desired records.
Observations
jupyter_notebooks
folder here. The user should not have to wait an hour for the file to be streamed to a database.