Closed hahn-kev closed 9 months ago
This should also let us verify that #515 is fixed.
bumping up the priority of this as we had an issue where it was broken and an admin couldn't recover a user issue due to this. So while this is not frequently used it is an escape hatch so we want to be sure it's working properly.
@hahn-kev - I currently have one working test in #550 (which I'd welcome feedback on if you have any), but I'm still figuring out how I'm going to integrate the Playwright Node tests with a dotnet Send/Receive test in order to test both sides of the equation (i.e., do a Send/Receive after a project reset, all as part of the test). I think I'll end up writing a small command-line tool in C# that can trigger a Send/Receive of the Sena-3 project, and call that command-line tool from Node. Then the Playwright tests are, conceptually, the E2E integration tests, and the C# code acts as a service.
Rethinking that approach. It might be better to create a new .NET test in SendReceiveServiceTests, which drives the API via ApiTestBase (similar to how the SendNewProject test works). The main thing we're concerned about here is that the reset project ends up in a good state (file permissions, etc) after resetting, and driving it via the UI in a Playwright test adds no benefit over driving it via calling the API.
So I'll leave #550 alone for now and start a new PR to create a .NET test or two in SendReceiveServiceTests.
Yeah I think that would be a better approach than creating a tool. It might be difficult to drive the test due to using tus for the upload, but that could be covered in another tests.
I'm thinking we have a playwright test that makes sure we can do a project reset, maybe query /hg/project-code/browse?style=json-lex
to determine that an uploaded file in the zip made it where we expect.
Then a S&R based reset test that ensures we can S&R to projects after a reset, but this doesn't need to go through the UI, we could even make a dedicated endpoint for this test to simplify the reset process (as long as it's consistent with the real reset process).
A quick record of my findings so far:
If the /var/vcs/public/sena-3
repo is empty in hgresumable when I try to do a Send/Receive to it, it goes into a nasty looping failure mode where NGINX returns a 499 (client closed connection early), then Chorus Resumable keeps retrying the connection because it doesn't know how to handle a 499. (Since the LexSyncReverseProxy is involved, the "client" that closed the connection might be the other end, i.e. hgweb; I haven't managed to track this down yet). Then you end up with hgresumable consumng all available CPU until I forcefully kill it off.
Logging into hgresumable and doing ps auxw
to see what's going on reveals a bunch of processes, one per worker thread, trying to run sh -c hg tip --template "{rev}:{branches}\n"
or sh -c hg log --template "{node|short}:{branches}\n"
and each using 100% CPU. That looks like something that the resumable PHP code is trying to run, but I haven't yet looked into the PHP code to see what code path tries to run those commands, so it might be something else.
I don't know yet why NGINX is seeing a closed connection on one end, while hgresumable is seeing hg processes (which should complete very quickly) stuck at 100% CPU. I'll continue investigating, but by the nature of the problem this is going to take a little time as every time I trigger this behavior my computer slows down quite a lot.
The root cause (well, ONE of the root causes) appears to be that Mercurial's behavior on an empty project is not what the PHP code expects to receive. When a project is empty, hg log --template "{node|short}:{branches}\n"
returns an empty string, while hg tip --template "{rev}:{branches}\n"
returns the string -1:\n
, i.e. there's nothing after the colon. This triggers a bug in the isValidBase
PHP code, which is fixed by the following patch to hgresumable:
diff --git a/api/src/HgRunner.php b/api/src/HgRunner.php
index 4b186d7..ac9d6bb 100644
--- a/api/src/HgRunner.php
+++ b/api/src/HgRunner.php
@@ -204,7 +204,7 @@ class HgRunner {
$i = 0;
while($foundHash < count($hashes)) {
$revisions = $this->getRevisions($i, $q);
- if (count($revisions) == 0) {
+ if (count($revisions) == 0 || count($revisions) == 1 && $revisions[0] == "0:") {
return false;
}
foreach($revisions as $hashandbranch) {
That's not enough to get us out of the woods, though. It solves the issue of hgresumable getting stuck in a loop, but Chorus is still stuck in a loop. Because once that patch is applied to the PHP code, Chorus now goes into a loop of receiving a 400 error and retrying it, even though the 400 error is a message from Mercurial saying "Hey, your request is invalid". Specifically, inside hgresume's pullBundleChunkInternal function, there's the following snippet:
// TODO This might be bogus, the given baseHash may well be a baseHash that exists in a future push, and we don't have it right now. CP 2012-08
if (!$hg->isValidBase($baseHashes)) {
return new HgResumeResponse(HgResumeResponse::FAIL, array('Error' => 'invalid baseHash'));
}
And Chorus is supposed to see the "invalid baseHash" message in the hgresume error and give up. However, it doesn't.
We can't patch Chorus. But we can patch our deployed version of hgresume. I suspect that if you have an empty repo, ALL base hashes are valid, because they are a baseHash that exists in a future push, as Cambell's comment from 2012 mentions. I'll try that and see what happens.
I added a check for an empty repo and had pullBundleChunkInternal
return HgResumeResponse::NOCHANGE if the repo is empty. That got Chorus to proceed with pushing the bundle. However, that then made Chorus think that the resumable server had revision 2, so it didn't push a bundle with all the changes, it just pushed a bundle with the changes since revision 2. This resulted in Mercurial erroring out with:
abort: 00changelog@6f010b976aaaab1561a29893bf9820c68af47794: unknown parent
Command exited with non-zero status 255
And that, in turn, produced a nasty retry cycle between Resumable and Chorus where Resumable was repeatedly throwing a PHP exception from AsyncRunner: "PHP Fatal error: Uncaught AsyncRunnerException: Lock file '/var/cache/hgresume/8ac501c2-225b-4d6c-b365-5df157d30502.bundle.async_run' not found, process is not running". But Chorus just kept retrying that, even though it was never going to get better on the server end.
I have one more thing to try. So far I've been using our Send/Receive tests (the ModifyProjectData test, specifically) to do this testing. That uses the Send/Receive process in LfMergeBridge. But there's one more Chorus code path: the "Send this project for the first time" option in FLEx. I will have to try out that code path on a project that's been reset.
Manual testing on FLEx shows that the "Send this project for the first time" option only appears when there is no .hg
folder in the project storage, which is not what we want.
However, I did find a way to work around the issue. If you're using resumable to send/receive the project in FLEx, the project folder (default location: C:\ProgramData\SIL\FieldWorks\Projects\(project name)
) will contain a "Chorus" folder. Delete the "Chorus" folder and then everything will work. I'll do one more experiment where the only thing I delete from the Chorus folder is the "revisioncache.db" file (named "revisioncache.json" on Linux), and see if that works.
That gives us a working procedure for a .NET test of sending projects via resumable after a project reset. First create a new project (with a tiny .fwdata file, see the SendNewProject test). Then do the following steps in the test:
For the Playwright reset-project test, create a new project (with a random project code so the test can run in parallel with itself) and commit a single file into it. Verify that we can see that file via /hg/browse in hgweb. Reset project to empty. Verify the file is gone. Reset project again, but this time upload the .zip file. Verify the file is visible again in hgweb.
since this is rarely used we might break it without realizing it, so we should have a test for it.