sartography / spiff-arena

SpiffWorkflow is a software development platform for building, running, and monitoring executable diagrams
https://www.spiffworkflow.org/
GNU Lesser General Public License v2.1
48 stars 36 forks source link

PP2 - Message related error #1239

Closed madhurrya closed 3 weeks ago

madhurrya commented 2 months ago

I have noticed this error 3 times during my testing of PP2.

Image

These are the instances.

  1. https://test.app.spiff.status.im/i/672

Image

https://test.app.spiff.status.im/i/125 Image

https://test.app.spiff.status.im/i/145

Image

I feel it might be happening when I test with 2 instances in parallel. But I am not sure, may be it's totally random since I couldn't recreate it consistently.

dinithihj commented 2 months ago

Noticed same error on dev.app image image

sashayar13 commented 2 months ago

Noticed the error with the prod instances. @danfunk, can you pls advise if there is a plan on how to fix the issue and how the admin can recover the Request Travel instance?

harmeet-status commented 2 months ago

@calexh-sar how many days is this and what can we do to have this completed by 1st April? Perhaps we can pause something else as we need this for Implicit Go-Live.

burnettk commented 2 months ago

ok, the linked pull request has the fix. sasha, we think this database update query would allow existing instances to get their messages matched up so the instances could proceed, ideally once this fix has been deployed, so no new errors are generated:

update message_instance set status = "ready", failure_cause = NULL where status = "failed" and failure_cause like "%already been locked by%";
madhurrya commented 2 months ago

@burnettk Is the above query already run on dev.app? Does that go in a migration script? Is it already there in the PRs you merged?

burnettk commented 2 months ago

we ran it on test.app but not dev.app. we didn't want to blow away all of the test data / broken instances. we actually didn't consider running it as a data migration automatically. that is an interesting idea.

madhurrya commented 2 months ago

Ok Thanks. Yes noticed that it has been updated on test.app. So the 3 instances I have mentioned above has been chanegd user input required state from waiting.

Then I ran one of the above instances again and when the upload evidence process was completed, the PP2 instance didn't change the state to user input required from waiting. We'll have to lookin to that.

https://test.app.spiff.status.im/i/145 https://test.app.spiff.status.im/i/1626

madhurrya commented 2 months ago

@burnettk If we are going to run this query in Prod, we'll have to first check the above issue.

jasquat commented 2 months ago

@madhurrya I think this is more or less working as expected.

pi 1626 fired off its last send and then completed however it's final send didn't not find a corresponding message receive.

pi 145 is a little trickier. It mostly worked as expected however the message instance send 205 found message instance 206 instead of 207 like it probably should have found. I think this is due to mi 206 have a blank dict as its correlation key. You can see some of this with the queries below.

MySQL [spiff-workflow-backend]> select counterpart_id, id, process_instance_id, message_type, status, name, correlation_keys from message_instance where JSON_EXTRACT(correlation_keys, '$.trvl_details.po_id') = "PO1453";                                                                                                                                                               
+----------------+-----+---------------------+--------------+-----------+----------------+---------------------------------------+
| counterpart_id | id  | process_instance_id | message_type | status    | name           | correlation_keys                      |
+----------------+-----+---------------------+--------------+-----------+----------------+---------------------------------------+
|            206 | 205 |                 145 | send         | completed | start_trvl_po  | {"trvl_details": {"po_id": "PO1453"}} |
|           NULL | 207 |                 145 | receive      | ready     | status_trvl_po | {"trvl_details": {"po_id": "PO1453"}} |
|            210 | 209 |                 947 | send         | completed | status_trvl_po | {"trvl_details": {"po_id": "PO1453"}} |
|            209 | 210 |                 145 | receive      | completed | status_trvl_po | {"trvl_details": {"po_id": "PO1453"}} |
+----------------+-----+---------------------+--------------+-----------+----------------+---------------------------------------+
MySQL [spiff-workflow-backend]> select * from message_instance where id = 206;
+-----+---------------------+---------------+--------------+---------+------------------+-----------+---------+----------------+---------------+-----------------------+-----------------------+
| id  | process_instance_id | name          | message_type | payload | correlation_keys | status    | user_id | counterpart_id | failure_cause | updated_at_in_seconds | created_at_in_seconds |
+-----+---------------------+---------------+--------------+---------+------------------+-----------+---------+----------------+---------------+-----------------------+-----------------------+
| 206 |                 947 | start_trvl_po | receive      | NULL    | {}               | completed |       3 |            205 | NULL          |            1710858795 |            1710858795 |
+-----+---------------------+---------------+--------------+---------+------------------+-----------+---------+----------------+---------------+-----------------------+-----------------------+

There seems to be a bug where blank dicts match too much. I think this is a known issue and may be apart of other message work already in flight.

I recommend not letting that issue stop this ticket from moving forward since more refactors for messages should be coming that will most likely address this.

madhurrya commented 2 months ago

@jasquat Yes I think it shouldn't block the ticket.

But that query will not fix that existing instance issue, if it is not going to work in the next iteration.

I tried another instance that got fixed by the query, https://test.app.spiff.status.im/i/672 Even thought this https://test.app.spiff.status.im/i/2075 completed, the above 672 instance is still in waiting state.

So even if the query change the state of the issued instance, it'll not work in the next round. So the user will not be actually able to upload evidence and complete the instance.

Sasha was updating Prod instances yesterday to support the new model updates we are going to release in about a week. As I can remember she mentioned that currently there is one instance in the above waiting state. Please discuss with her how we are going to handle that instance in Prod.

jasquat commented 2 months ago

It looks like both pi 2075 and pi 672 have different po numbers. From the pi show page, look at the message tab. If you click on the "View" link for the message instances marked "ready" you will see the numbers are different. Looking at the task data, it looks like in 672 started with po PO6721 and then on the loop changed to PO6722, however the receive message is still waiting for PO6721. That's probably another issue within the current message system that needs to be fixed, like it can't update the correlation keys for receive messages in the same instance.

madhurrya commented 2 months ago

But in task data for both instances I see the "po_id": "PO6722" . I really don't have a much idea about the messages yet. :-) But from a user's point of view 2075 was created when I ran 672 and I would expect it to change the state when 2075 was closed. Like you have said, it must be due to the issue you have mentioned.

So due to those existing issues, I think running that query will not fix the issue as we can't get it to continue after that.

I think we had a similar issue when we tried to reset to the start of that loop and tried to execute it again with a new message. Same happened.

jasquat commented 2 months ago

That's the po_id at the end of the process instance. If you click on the manual task before the receive message in 672 - "Starting a new process for you" - for task guid 0db1791f-b2e1-42f6-b610-10422e2c549e, po_id will show PO6722. However, in the task modal, if you scroll down to "Task Instances" and you click the view icon the other task, it should bring up 02d15324-64aa-4c0e-9c0e-00b512daaa7d. Viewing the task data for po_id you will see PO6721.

So the query may fix issues if the po id doesn't change in the loop. Have you run other tests with fresh instances where the po id changes in that loop and everything works?

madhurrya commented 2 months ago

May be Sasha is increasing it in the loop? I really don't have much idea about that process implementation. Yes, the normal flow works. It has always worked with even 3,4 iterations. Except for when we got the above bg lock error.

jasquat commented 2 months ago

Okay, so it looks like for every failed receive message, it created a new receive message that hasn't error'd. Maybe if the query only updates send messages then it'll work better. The send messages have not been duplicated like the receives.

sashayar13 commented 2 months ago

@jasquat, the issue I have in production, which was noticed in dev and tested a few times, is unrelated to the po_id. PO_id is increased after each cycle is successfully completed.

My example in the prod is that both failed "send" and "receive" messages using the same correlation key. See the recording attached. And this is the 2d loop, so the po_id uses "2" at the end of the number.

https://github.com/sartography/spiff-arena/assets/104367255/0035eaa3-6f3c-4f1b-b3bd-6085937a3a34

Worth mentioning that at this moment this is the only case where I found the errors - so maybe there is an elegant way to handle this one case?

jasquat commented 2 months ago

@sashayar13 @madhurrya

@burnettk and I tested my theory with process instance 943 and 941 on dev.app. 943 had the send the message and 941 had the receiving. We updated 943's send message so it would run again and it forced 941 to create a new receiving message instance which it was able to pass the payload to. 941 was able to complete after that.

Note, the fact that this works is a bug that we should fix. Our initial db query should have kicked everything off properly and the trick we just attempted should not have worked but it did. A test for this could be added to the message-improvements-2 branch as well and fixed there.

sashayar13 commented 2 months ago

@sashayar13 @madhurrya

@burnettk and I tested my theory with process instance 943 and 941 on dev.app. 943 had the send the message and 941 had the receiving. We updated 943's send message so it would run again and it forced 941 to create a new receiving message instance which it was able to pass the payload to. 941 was able to complete after that.

Note, the fact that this works is a bug that we should fix. Our initial db query should have kicked everything off properly and the trick we just attempted should not have worked but it did. A test for this could be added to the message-improvements-2 branch as well and fixed there.

@jasquat, thanks for the update! Could you please help me to understand what is the plan for the a) failed instance in prod and b) code release with the fix?

madhurrya commented 2 months ago

@jasquat it seems you have completed the instance immediately, not sure whether you tried another iteration. Earlier times the issue was when I tried it next round. image

Can you please do the fix for https://dev.app.spiff.status.im/i/2116 so I can check it there. But please don't complete the instance :-) Just run your fix and let me know.

madhurrya commented 1 month ago

When you fix the above please let me know.

jasquat commented 1 month ago

@madhurrya I ran the update for message instance 1026 on https://dev.app.spiff.status.im/process-instances/manage-finance:accounts-payable:process-expense-reports:upload-expense-evidence-travel/2132?tab=3. That should push pi 2116 along.

jasquat commented 1 month ago

@sashayar13 I believe at any time now (or whenever @madhurrya okays it) someone who has access (I do not but @jakubgs should) can run this update on the prod db and it should fix the issue:

UPDATE message_instance SET status = "ready", failure_cause = NULL
  WHERE status = "failed"
  AND failure_cause LIKE "%already been locked by%"
  AND message_type = "send" ;

This can also be run to see what message instances will be updated:

SELECT * from message_instance
  WHERE status = "failed"
  AND failure_cause LIKE "%already been locked by%"
  AND message_type = "send" ;
madhurrya commented 1 month ago

@jasquat It worked this time :-)

sashayar13 commented 1 month ago

@jasquat, so we do not need to deploy any code changes to prod, and we can just run the query?

burnettk commented 1 month ago

@sashayar13 that's right. we have already released the fix that should prevent future occurrences of this issue.

jakubgs commented 1 month ago

Done on prod.

mysql> UPDATE message_instance SET status = "ready", failure_cause = NULL   WHERE status = "failed"   AND failure_cause LIKE "%already been locked by%"   AND message_type = "send" ;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 0
madhurrya commented 1 month ago

Didn't notice the original issue during the latest testing in test.app. That issue seems to be fixed now.