tl-its-umich-edu / mpr-research-data

3 stars 4 forks source link

Django DB to GCP Pipeline Application Commit (iss. #1, #2, #7) #3

Closed takposha closed 2 years ago

takposha commented 2 years ago

Currently working/Completed: dbToBucketScript.py can access the DB, retrieve Course IDs, then retrieve corresponding Course data and send them to a GCP bucket. Error messages and comments have been applied across the script so it should be easier to know if and why something does not work. The SQL queries can be modified using variables the user provides. There is a basic README file now. It needs to be updated for config parameters.

Not working/needs to be done: I'm not sure how a config file should be made for a Docker application. I have listed all variables that are to be config ones in the python script files, so it should just be a matter of moving them into a config file. Will use a .env file setup to do this. I don't know how frequent data calls are, and if the files get updated, but maybe having some way to check if the GCP bucket TSV file already exists and is up to date can help avoid unnecessary data pushes to the bucket. This shouldn't matter if it's only a small amount of data daily.

Resolves #1 Resolves #2 Resolves #7

takposha commented 2 years ago

Added a config file for adjusting variables. Updated ReadMe for steps on config file management. Adjust Docker files to read .env file and disabled sleep infinity.

pep8speaks commented 2 years ago

Hello @takposha! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 74:120: E501 line too long (141 > 119 characters)

Comment last updated at 2022-05-10 15:59:09 UTC
takposha commented 2 years ago

That should be the last commit. VSCode PEP 8 formatting doesn't catch everything PepSpeaks does. But that should be sorted now.

lsloan commented 2 years ago

@pep8speaks suggest diff

takposha commented 2 years ago

Added the better GCP key implementation.

Added logging to console using the logging library.

lsloan commented 2 years ago

I thought what you had before with the GCP key JSON embedded in the .env file was a good thing. I think of that key in JSON as a single entity. We don't know what all of its contents will be. Google may change it at some point. So, if we break it down into its components now, maybe in the future it won't work. It seems better to keep it as a single JSON string.

My change reintroduced the json module, so I organized the imports, too. It's fairly common to organize imports by core Python modules, followed by third-party modules, and lastly local project modules.

lsloan commented 2 years ago

So, the good news is that when I ran the application that I checked out of your branch, it worked well.

Aside from the GCP key change I made, I made a couple of changes to my .env file, too. I changed NUMBER_OF_MONTHS = 1 to run a shorter test. And I changed GCLOUD_BUCKET = 'mpr-research-data-uploads-lsloan_test', to write to a different bucket for my test, so it wouldn't disturb what was already there.

I ran the app before I created the new bucket, just to see what would happen. I got results like this:

mpr-research-data  | 2022-05-10T14:44:52+0000 INFO     [mpr-research-data.py:170] - Slicing: 495677 - Math 216 WN 2022.tsv
mpr-research-data  | 2022-05-10T14:44:52+0000 INFO     [mpr-research-data.py:181] - Saving to GCP: 495677 - Math 216 WN 2022.tsv
mpr-research-data  | 2022-05-10T14:44:56+0000 ERROR    [mpr-research-data.py:186] - Error Message: 404 POST https://storage.googleapis.com/upload/storage/v1/b/mpr-research-data-uploads-lsloan_test/o?uploadType=multipart: {
mpr-research-data  |   "error": {
mpr-research-data  |     "code": 404,
mpr-research-data  |     "message": "The specified bucket does not exist.",
mpr-research-data  |     "errors": [
mpr-research-data  |       {
mpr-research-data  |         "message": "The specified bucket does not exist.",
mpr-research-data  |         "domain": "global",
mpr-research-data  |         "reason": "notFound"
mpr-research-data  |       }
mpr-research-data  |     ]
mpr-research-data  |   }
mpr-research-data  | }
mpr-research-data  | : ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>)
mpr-research-data  | 2022-05-10T14:44:56+0000 ERROR    [mpr-research-data.py:187] - Failed to upload Course Data for 495677 - Math 216 WN 2022.tsv to GCP.

Which is good. It reported that once for each of the courses that met my 1-month criteria.

So, I created the required bucket, ran it again, and all was good.