materialsproject / fireworks

The Fireworks Workflow Management Repo.
https://materialsproject.github.io/fireworks
Other
361 stars 185 forks source link

Recover lost runs due to network problem #396

Open davidwaroquiers opened 4 years ago

davidwaroquiers commented 4 years ago

Discussing with @gpetretto , we thought it would be interesting to have the following feature :

When a Firework job in normal mode (as opposed to offline) completes successfully (i.e. the job has no errors), but cannot update its status in the database (e.g. network is down), a FW.json file is written to the disk so that it can be recovered somehow (e.g. by detect_lostruns).

This would prevent rerunning jobs that actually finished but could just not update the database.

I don't think it would be a great deal to implement this but there are probably a few things to take into account so feel free to comment on this proposition.

computron commented 4 years ago

As long as it can be done cleanly in the code and not making too much complication I think this could be fine. I guess the mechanism for doing this could be very similar to how offline mode already operates.

How often do you see this actually happen?

davidwaroquiers commented 4 years ago

Hi,

Not so often but when it happens it can really be annoying. One thing I should mention is that it can happen for example when you have a very large number of small jobs. Let's say you have 100000 jobs, each of which are less than a minute (well, in principle, another solution could be better, such as pack those very small jobs or something but anyway), then the access to the database can be difficult and it can be a problem. Anyway here it's not a problem because rerunning even 10000 of these jobs is just less than a minute each. But then imagine that in parallel, you also have 100 very large jobs (or another colleague sharing the same db), then you may loose part of these large jobs if they fail to update the db. In the past, I've had some things like that and I "hacked" a little bit the things to make it continue without rerunning everything but it would be beneficial to have it directly.

I will see with @gpetretto how this could be performed cleanly. Indeed it would probably be very similar to how the offline mode is working but we have to check that.

Best,

David