Closed omerb01 closed 4 years ago
@LachlanStuart I added serialise()
and deserialise
because of these reasons:
with pandas version 0.25.1, I always get a warning message says that read_msgpack()
method has deprecated when it is executed. when we used the full serverless workload, it wasn't so bad cause it only affected the logs inside CF. now we have a hybrid approach that outputs too many warning messages like this one that makes the logs too messy. so I decided to move forward into what pandas suggests when it comes to serialisations - pyarrow module.
I read about it a bit in docs, pyarrow claims that by using its serialiser, it can be better than pickle
in terms of memory compression and time. the generic approach of pickle
makes it basically the worst serialiser while considering memory and time. pay attention that in serialise()
and deserialise()
there is also a logic to use pickle
when we can't use pyarrow, for example when trying to serialise internal metaspace classes.
in addition, the usage of one serialiser and one deserialiser makes the code easy maintainable - by a simple change we can change the serialisation technique of the whole project.
@omerb01 The serialize
/deserialize
functions make sense and tidy the code a lot. Thanks for implementing them.
This patch rearrange the project and fix various parts after last PR.
main changes:
run_pipeline.py
script to be more applicable and removegenerate_centroids.py
scriptPipeline
class has 2 bool parametersuse_ds_cache
anduse_db_cache
. these parameters will manage automatically the same behaviour as before (but automatic) - for example, we can use a cached db and flush the ds related stages by passing onlyuse_ds_cache=False
.read_msgpack()
andto_msgpack()
pandas methods. now we have one serialise and one deserialise funcs that manage every needed serialisation of the project.I tested everything with the serverless and vm approach over ds2/db2 and checked if the results are correct. pay attention that the new serialiser is based on
pyarrow
and we also need to update PyWren's dockerfile to get these changes to work.