ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.59k stars 1.01k forks source link

Question: Storing custom data beside tasks in the central task database #914

Open ottointhesky opened 1 day ago

ottointhesky commented 1 day ago

We use ipp for distributed processing of large topographic data sets. Therefore, we typically split the data set into spatial tiles. Each tile is then distributed as separate task through ipp. To give the user detailed feedback on ipp cluster and the processing status, we are planning to create a kind of dashboard application which graphically shows tiles that have been (successfully or not successfully) processed, tiles that are currently computed and tiles that are waiting in the queue. So we need some sort of mapping between task message id (_msgid) and our tile identifier. Of course it would be possible to store this mapping outside the ipp task database, but it would be a hassle to keep everything aligned/in sync. Things would be much easier, if the task database/interface would allow storing a custom data field (e.g. a comment string or something similar) which could store our tile id. This way no external mapping is needed.

minrk commented 13 hours ago

I think storing arbitrary custom data is probably not something we should support, but a simple string task label seems sensible enough, and seems like it would work for what you are describing. Does that sound right?

ottointhesky commented 13 hours ago

You are perfectly right. A task label or task comment would do the trick.

minrk commented 13 hours ago

Yes, I think that's doable. We would need to come up with the APIs for setting these and retrieving tasks based on them.

ottointhesky commented 11 hours ago

Well I have a narrow view on things, but I would suggest something like

ar = view.apply_async(task, label='my task label')

in case apply_async is sent to multiple engines (dview[:]) I would use the same label for all tasks

I'm not sure if AsyncResult should provide (read only) access to the label, but I guess it would be nice. Since if haven't looked into the task database code yet, I cannot comment on how this should be handled inside.

Just for completeness: I'm willing to help with the implementation if needed/wanted....