Improve how our Fly deployment via Wasp CLI handles errors

Martinsos commented 1 year ago

wasp deply fly launch is great if everything is going well, but if something goes wrong (and it often does on Fly lately), then it is really hard for a user to figure out what to do now. Where should they continue from? What is already set up?

I would suggest we do two things:

Make our script a bit more robust to errors from the Fly side: if error happens, it could try to repeat the command, in case Fly is just being flaky.
We could make it clearer to user how to continue after the error.

For (2), one idea that we could do is that before we start executing commands, we could make a plan first. That means that our wasp deploy fly launch command would first print the plan, which is a list of commands it plans to run, and in which order, with some explanations for the user. Then, if the whole things breaks at some point, user can in theory pick up on their own from that point and run the rest of the commands from the plan.

Alternatively, Wasp CLI could itself be smart enough to recognize where the whole process broke and continue from there -> either by learning from Fly which resources are already allocated and skipping those, or by maybe even having it written locally on the disk.

Third option is to roll-back: if wasp deploy fly launch fails, then we would remove all the resources that were allocated during it so far, so that the whole command becomes transactional, and the next time we run it we start with a clean slate, if error occurred previously. That said, this sounds tricky and what if deleting resources also fails? So I am not so sure about this approach.

Martinsos commented 1 year ago

For me, for example, it died while trying to attach db to the server. Connection timed out. Who knows why -> maybe it took longer for db to initialize? That is an operation we might benefit from retrying.

Error: Get "http://fdaa:2:6dba:a7b:2cc3:401d:954a:2:5500/commands/databases/wasp_thoughts_server": connect tcp [fdaa:2:6dba:a7b:2cc3:401d:954a:2]:5500: operation timed out

However, the result of that was that I had no idea what to do next.

I did manage to figure it, but it was not easy: I had to manually run fly ... command to attach that db (I could find it in stdout of failed command, so that was a plus), and then I ran wasp deploy fly deploy to finally do the deployment -> that was a bit harder to figure out.

It would be great if I knew exactly which steps worked out and at which step it failed, and I was told that I needed to somehow attach the db myself, and then I need to continue with wasp deploy fly deploy. Or, that I can delete everything and try from scratch. Something like that.

EDIT: I ended up deleting all the machines Fly created and basically started from the scratch. Although that was also tricky because it didn't want to do it right on, because our CLI said we already have toml files on the disk.

Martinsos commented 1 year ago

Another error I got on wasp deploy fly deploy is:

https://api.machines.dev/v1/apps/wasp-thoughts-server/machines/e286512dc75038/wait?instance_id=01HCJ5JYN2Y1B280H6YYZQ28MV&state=started&timeout=60

wasp-lang / wasp

Improve how our Fly deployment via Wasp CLI handles errors #1498