General changes and fixes

omerb01 commented 4 years ago

This patch rearrange the project and fix various parts after last PR.

main changes:

update PyWren version to 1.7.0 to solve cloudobjects collisions
fix serverless shuffle segmentation to clean temp intermediate cloudobject during the process
update run_pipeline.py script to be more applicable and remove generate_centroids.py script
move the pre-stage of mols dbs upload into Pipeline as a part of "mol db process"
remove pywren calls in VM code to reduce costs a bit
enhance stats file to support in pywren run stats and also in vm run stats automatically
improve and fix cacher to manage ds and db separately,Pipeline class has 2 bool parameters use_ds_cache and use_db_cache. these parameters will manage automatically the same behaviour as before (but automatic) - for example, we can use a cached db and flush the ds related stages by passing only use_ds_cache=False.
get rid of the annoying warning message of read_msgpack() and to_msgpack() pandas methods. now we have one serialise and one deserialise funcs that manage every needed serialisation of the project.
logger prints + notebooks + docs update

I tested everything with the serverless and vm approach over ds2/db2 and checked if the results are correct. pay attention that the new serialiser is based on pyarrow and we also need to update PyWren's dockerfile to get these changes to work.

omerb01 commented 4 years ago

@LachlanStuart I added serialise() and deserialise because of these reasons:

with pandas version 0.25.1, I always get a warning message says that read_msgpack() method has deprecated when it is executed. when we used the full serverless workload, it wasn't so bad cause it only affected the logs inside CF. now we have a hybrid approach that outputs too many warning messages like this one that makes the logs too messy. so I decided to move forward into what pandas suggests when it comes to serialisations - pyarrow module. I read about it a bit in docs, pyarrow claims that by using its serialiser, it can be better than pickle in terms of memory compression and time. the generic approach of pickle makes it basically the worst serialiser while considering memory and time. pay attention that in serialise() and deserialise() there is also a logic to use pickle when we can't use pyarrow, for example when trying to serialise internal metaspace classes. in addition, the usage of one serialiser and one deserialiser makes the code easy maintainable - by a simple change we can change the serialisation technique of the whole project.

LachlanStuart commented 4 years ago

@omerb01 The serialize/deserialize functions make sense and tidy the code a lot. Thanks for implementing them.

metaspace2020 / Lithops-METASPACE

General changes and fixes #78