Duplicates results in the PM job

koonchaya commented 8 months ago

Original post: https://rfcx.slack.com/archives/C03FD1WD02J/p1709734386019869 From: @tomaznmelo

Issue description: I just notice that the matches resulted from the PM jobs in the Colombia Humboldt project (https://arbimon.org/project/colombia-audiomoth-humboldt-inst) are duplicate with the exact same score and timeframe (see image in the comment). This is not good because for training the CNN we need to validate different ROI/matches as present to feed the model.

Steps to reproduce: Go to this PM job for instance and check that some matches have the same image, score and timeframe: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840

Expected result: All matches are different

Actual result: Some matches are duplicate.

Additional details: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840

koonchaya commented 7 months ago

Check for why this happen. For further investigation... @rassokhina-e Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users. @koonchaya Rerun the job. - we can see if the duplicate will happen again.

koonchaya commented 7 months ago

Test job created - job name: Test duplicate job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/58470

I checked 3 pages of the results did not see duplicate.

rassokhina-e commented 7 months ago

I found the original problem for the job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840, there was a connection problem to the MySQL db and lambda was trying again and again to put the rois to the db. My point is that some of the insert requests were success and for this reason the job has duplicates.

Job ID: 84670 the original issue: Lock wait timeout exceeded; try restarting transaction lambda sends log to reconnect to db: Trying MySQL reconnection

rassokhina-e commented 7 months ago

the count of inserting rois is 40 rows, this is a small count, we do not need to decrease it

rassokhina-e commented 7 months ago

Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users.

I've run this query and I guess we can check this year, but it takes 1 day to check it, there are a lot of jobs, one query cannot support it

rassokhina-e commented 7 months ago

I found 32 jobs (from 7 to 12 of March) where rois are duplicated, some of the jobs have a small count of duplicates (2 rois, 6 rois), the insert count of PM rois is 40 rois, it means the problem with the analysis code, not with the MySQL error connection.

script:

SELECT pmr.*
FROM arbimon2.pattern_matching_rois pmr
where pmr.pattern_matching_id in (....)
group by pmr.pattern_matching_id, pmr.recording_id, pmr.x1, pmr.score having count(*) >= 2;

The duplicated jobs: 58081 58084 58080 58020,58021 58028,58038 58042 58043 58044 58045 58050,58051,58055,58056 58057,58058,58060,58068,58069,58071,57987,57991,57992,57990,57993,57994,57995,57996,57997,57999,58001,58005,57998,57973,57974,57975,57976,57977,57979,57980,57983,57986

rassokhina-e commented 7 months ago

all these PM jobs above have the same problem with connection to MySQL

rassokhina-e commented 7 months ago

Nelson will check the analysis code today

antonyharfield commented 7 months ago

We can see that there was a "lock wait timeout exceeded" for the insert with recording_id = 121015041 and x1 = 26.368. If we check the database then there are 2 rows for this entry:

select * from pattern_matching_rois where recording_id = 121015041 and pattern_matching_id = 57840 and x1 = 26.368;

@rassokhina-e, can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.

antonyharfield commented 7 months ago

I think the best solution is to use INSERT IGNORE feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE constraint on pattern_matching_id, recording_id, x1.

Proposed resolution steps:

[ ] Delete all duplicate rows from the table
[ ] Add unique constraint
[ ] Update code to use INSERT IGNORE

antonyharfield commented 7 months ago

FYI @rassokhina-e @koonchaya, I think this will need at least a day or 2 to carefully update the database and then some more days testing the new code by running some big jobs.

We should also test all of this on staging before deploying to production.

rassokhina-e commented 7 months ago

can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.

I'm trying to find any lambda logs and see a lot of places with this issue, also related to MySQL connection

rassokhina-e commented 7 months ago

Unfortunately, the cloud watch reports have time gaps, also the reports go each by every second with logs inside. It's really complicated to find the duplicates there. I followed by 2 latest jobs, but the result wasn't successful.

rassokhina-e commented 7 months ago

I think the best solution is to use INSERT IGNORE feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE constraint on pattern_matching_id, recording_id, x1.

Proposed resolution steps:

Delete all duplicate rows from the table Add unique constraint Update code to use INSERT IGNORE

If the user validated 2 duplicates, we cannot remove the second duplicate, because the recording validation logic already calculated the present_review value. Can we remove only unvalidated + duplicated values?
The unique constraint will increase the time to insert logic, and the lambdas will be slow. Are we sure we want to make the insert logic slowly?

antonyharfield commented 7 months ago

rfcx / arbimon-legacy

Duplicates results in the PM job #1469