jcarreira / cirrus

Serverless ML Framework
Apache License 2.0
22 stars 9 forks source link

Preprocessing #92

Closed neelsomani closed 5 years ago

neelsomani commented 6 years ago

Not ready to merge.

handler.py is an AWS Lambda function to get the local bounds for chunks of data, and scale chunks of data to global bounds. aggregate.py reads in local bounds for all chunks of data (computed by handler.py) and aggregates them into the global bounds.

Lambda function is getting the error "Process exited before completing request," but runs fine locally.

neelsomani commented 6 years ago

Ready for review. Everything works now.

Here's how it scales w.r.t. the number of chunks of data: 1 object: 29.955233097076416 seconds 10 objects: 64.29478168487549 seconds 100 objects: 152.88286185264587 seconds

Next steps could be to merge this PR and generalize the interface in another, or to make more preprocessing scripts in this PR.

neelsomani commented 5 years ago

Hi @jcarreira, I've added a preprocessing module called Preprocessing.py, which is ready for review. An alternative that I was thinking about was maybe modeling the interface off of the sklearn preprocessing interface. Specifically, we would have one module called preprocessing.py, which would have the classes MinMaxScaler, NormalScaler, etc. within it. Then the user would initialize the scaler:

s = preprocessing.MinMaxScaler(s3_src_bucket, s3_dest_bucket, 0.0, 1.0)

and run it with something like s.normalize().

neelsomani commented 5 years ago

Hey @jcarreira, updates:

Ready for review, or we can add other preprocessing if needed.

jcarreira commented 5 years ago

You use python's hash() to hash the data. It seems [1] that for newer versions this function is non-deterministic which is not what we want. We should find how to make this deterministic.

https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

neelsomani commented 5 years ago

Hey @jcarreira, went through and resolved the edits. Left a couple of comments where things are ambiguous.

For the hash function, I've changed it to use hashlib so it works with Python3 as well.

neelsomani commented 5 years ago

Hey @jcarreira, I just finished cleaning up the code, further refactoring, and running pylint - PTAL

neelsomani commented 5 years ago

Hey @jcarreira, this latest push is actually all of the files after pylint. Any errors that are still in pylint either can't be fixed (e.g., it's saying too many parameters are being passed to a function that cannot be refactored, like the invocation of a LambdaThread), or it would break forwards compatibility (removing parentheses from print statements). Using pylint for Python2

Edit: Added in commit for the pylintrc file added this morning

jcarreira commented 5 years ago

Please fix conflicts.

neelsomani commented 5 years ago

Hey @jcarreira, fixed the merge conflicts

neelsomani commented 5 years ago

Ready for review here @jcarreira