Closed koonchaya closed 5 months ago
Older related issue https://rfcx.slack.com/archives/G016Y0627QW/p1640009028003000
Check for why this happen. For further investigation... @rassokhina-e Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users. @koonchaya Rerun the job. - we can see if the duplicate will happen again.
Test job created - job name: Test duplicate job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/58470
I checked 3 pages of the results did not see duplicate.
I found the original problem for the job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840, there was a connection problem to the MySQL db and lambda was trying again and again to put the rois to the db. My point is that some of the insert requests were success and for this reason the job has duplicates.
Job ID: 84670
the original issue: Lock wait timeout exceeded; try restarting transaction
lambda sends log to reconnect to db: Trying MySQL reconnection
the count of inserting rois is 40 rows, this is a small count, we do not need to decrease it
Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users.
I've run this query and I guess we can check this year, but it takes 1 day to check it, there are a lot of jobs, one query cannot support it
I found 32 jobs (from 7 to 12 of March) where rois are duplicated, some of the jobs have a small count of duplicates (2 rois, 6 rois), the insert count of PM rois is 40 rois, it means the problem with the analysis code, not with the MySQL error connection.
script:
SELECT pmr.*
FROM arbimon2.pattern_matching_rois pmr
where pmr.pattern_matching_id in (....)
group by pmr.pattern_matching_id, pmr.recording_id, pmr.x1, pmr.score having count(*) >= 2;
The duplicated jobs: 58081 58084 58080 58020,58021 58028,58038 58042 58043 58044 58045 58050,58051,58055,58056 58057,58058,58060,58068,58069,58071,57987,57991,57992,57990,57993,57994,57995,57996,57997,57999,58001,58005,57998,57973,57974,57975,57976,57977,57979,57980,57983,57986
all these PM jobs above have the same problem with connection to MySQL
Nelson will check the analysis code today
We can see that there was a "lock wait timeout exceeded" for the insert with recording_id = 121015041 and x1 = 26.368. If we check the database then there are 2 rows for this entry:
select * from pattern_matching_rois where recording_id = 121015041 and pattern_matching_id = 57840 and x1 = 26.368;
@rassokhina-e, can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.
I think the best solution is to use INSERT IGNORE
feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE
constraint on pattern_matching_id, recording_id, x1
.
Proposed resolution steps:
INSERT IGNORE
FYI @rassokhina-e @koonchaya, I think this will need at least a day or 2 to carefully update the database and then some more days testing the new code by running some big jobs.
We should also test all of this on staging before deploying to production.
can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.
I'm trying to find any lambda logs and see a lot of places with this issue, also related to MySQL connection
Unfortunately, the cloud watch reports have time gaps, also the reports go each by every second with logs inside. It's really complicated to find the duplicates there. I followed by 2 latest jobs, but the result wasn't successful.
I think the best solution is to use INSERT IGNORE feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE constraint on pattern_matching_id, recording_id, x1.
Proposed resolution steps:
Delete all duplicate rows from the table Add unique constraint Update code to use INSERT IGNORE
Other possible solutions:
@antonyharfield Can we share your latest proposal with Stas? Can he support us with this logic?
Stas suggestion from the eng connect call today:
Before sending the completed
status to MySQL db from the PM lambda, check the duplicates and remove them if they exist in db for this particular job.
@antonyharfield
@antonyharfield @rassokhina-e Do we have any further update for this issue fix?
I guess we haven't planned to fix it yet
Original post: https://rfcx.slack.com/archives/C03FD1WD02J/p1709734386019869 From: @tomaznmelo
Issue description: I just notice that the matches resulted from the PM jobs in the Colombia Humboldt project (https://arbimon.org/project/colombia-audiomoth-humboldt-inst) are duplicate with the exact same score and timeframe (see image in the comment). This is not good because for training the CNN we need to validate different ROI/matches as present to feed the model.
Steps to reproduce: Go to this PM job for instance and check that some matches have the same image, score and timeframe: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840
Expected result: All matches are different
Actual result: Some matches are duplicate.
Additional details: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840