[META] Filesync fails silently and breaks watches

grayside commented 6 years ago

Problem

@illepic reports that filesync's silent failures cause developers to work for an indefinite amount of time with uncertainty if troubleshooting fixes are failing because they are wrong or because the filesync isn't carrying changes properly.

As a result, we have wasted time from troubleshooting after a fix has been found, as well as a growing distrust in the filesync system.

This has been seconded by a number of other developers, raising the profile of this issue to seemingly the largest source of trouble for Outrigger users. Thank you to everyone that has spoken up about this problem.

Solution

With such a sweeping problem statement, it is impossible to declare a single solution. Rather, we will treat this like a "meta" issue, a bug that will require multiple changes to fully address. The definition of done should be that this problem stops being encountered for a reasonable length of time.

Related Issues

Here are the issues so far identified to help support this goal:

https://github.com/phase2/outrigger-docs/issues/30: Write a temporary PSA in the documentation
163: Add a tool to detect sync failures that can be run manually or in the background.
164: Automatically restart broken unison processes

Use of This Issue

Report specific reproduction steps that cause Unison to crash
Report any steps/upgrades to rig that make your problem go away.
Suggest changes to rig or the documentation here so they can be coordinated with efforts underway.

grayside commented 6 years ago

As an alternative to building out the restart of unison process inside the container (assuming it's crashing of the server and not of the client we need to be concerned with) we could take steps to build a smarter healthcheck into our unison container image and set something up to auto-restart when it reports unhealthy. This has some answers on how we might do that: https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

grayside commented 6 years ago

I'm working on a rig project sync:check to operate as a sort of doctor check of the unison process.

grayside commented 6 years ago

Collecting some research avenues:

rig project sync defaults fs.inotify.max_user_watches to 100,000 for the docker-machine. Is this enough? Probably.
- Do we need to increase this number inside the container? Maybe.
When lots of files change, or a file is changed while being synced from a change, it can cause the high CPU spikes mentioned in the referenced issue. https://github.com/EugenMayer/docker-image-unison/pull/11/files demonstrates using Monit to monitor for performance and use supervisord to restart the unison server process.

mkochendorfer commented 6 years ago

This has happened to me and other developers countless times. It is incredibly frustrating and wastes untold hours of time going down the wrong paths debugging things that are really just that your code changes are not making it into the container. This is by far the highest priority issue with rig currently today.

srjosh commented 6 years ago

I've run into this quite frequently when working on client work; it definitely seems tied in my case to my host machine going to sleep/waking up. It is definitely frustrating.

grayside commented 6 years ago

Note: This issue is now: support request, problem research, "doctor" research, and autoheal research. I will probably split this apart in the next few days. I'm breaking the "doctor" angle here to #163

grayside commented 6 years ago

I have converted this issue to a METABUG, please re-read the issue summary for details on what we are doing so far and what this issue should continue to be used for.

grayside commented 6 years ago

Further discussion with afflicted users has pointed out one of the major error cases is resume-from-sleep. Improved handling of sleep/suspend/hibernation operations may go a long way to address this problem.

crittermike commented 6 years ago

Some of us have gotten in the habit of just assuming it's broken both when starting dev (for the day or after a break) and also whenever something unexpected happens, and running sync:start proactively before doing anything else.

febbraro commented 6 years ago

@mikecrittenden Does that approach of always running sync:start more or less alleviate any of the unison problems?

potterme commented 6 years ago

I don't run into this with sleep, but I do run into it when sleeping+changing-networks, such as going from office to home and back. In my experience doing sync:start always fixes it.

This is different from the unison quitting because there are too many file changes. Changing the max_user_watches "might" help. This often happens when doing something that seems simple, like mv vendor vendor_old or rm -rf node_modules. Deletions seem to cause the most issues. When doing a "mv" unison sees this as both a file deletion and a file addition.

I'm not sure I'm in favor of something trying to auto-restart unison processes, since there have been cases where I've shut down unison on purpose. But a tool to detect a problem and notify would be useful.

Education and docs on this definitely the most useful. Once this has happened to somebody a few times they stop going down hour-long debugging rabbit holes and start checking unison more often, so even part of "rig doctor" would be helpful. But also helpful for devs to think more about what is happening when they do stuff like "mv vendor vendor_old" and why they might be doing that.

crittermike commented 6 years ago

@febbraro yeah that seems to handle it for me. Typically if I see issues now it's because I just forgot to run that command. I don't usually see it crash in the middle of doing something, but I might just be lucky.

phase2 / rig