metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

General changes and fixes #78

Closed omerb01 closed 4 years ago

omerb01 commented 4 years ago

This patch rearrange the project and fix various parts after last PR.

main changes:

I tested everything with the serverless and vm approach over ds2/db2 and checked if the results are correct. pay attention that the new serialiser is based on pyarrow and we also need to update PyWren's dockerfile to get these changes to work.

omerb01 commented 4 years ago

@LachlanStuart I added serialise() and deserialise because of these reasons:

with pandas version 0.25.1, I always get a warning message says that read_msgpack() method has deprecated when it is executed. when we used the full serverless workload, it wasn't so bad cause it only affected the logs inside CF. now we have a hybrid approach that outputs too many warning messages like this one that makes the logs too messy. so I decided to move forward into what pandas suggests when it comes to serialisations - pyarrow module. I read about it a bit in docs, pyarrow claims that by using its serialiser, it can be better than pickle in terms of memory compression and time. the generic approach of pickle makes it basically the worst serialiser while considering memory and time. pay attention that in serialise() and deserialise() there is also a logic to use pickle when we can't use pyarrow, for example when trying to serialise internal metaspace classes. in addition, the usage of one serialiser and one deserialiser makes the code easy maintainable - by a simple change we can change the serialisation technique of the whole project.

LachlanStuart commented 4 years ago

@omerb01 The serialize/deserialize functions make sense and tidy the code a lot. Thanks for implementing them.