rfcx / arbimon-legacy

https://arbimon.org
Apache License 2.0
0 stars 0 forks source link

Duplicates results in the PM job #1469

Closed koonchaya closed 5 months ago

koonchaya commented 8 months ago

Original post: https://rfcx.slack.com/archives/C03FD1WD02J/p1709734386019869 From: @tomaznmelo

Issue description: I just notice that the matches resulted from the PM jobs in the Colombia Humboldt project (https://arbimon.org/project/colombia-audiomoth-humboldt-inst) are duplicate with the exact same score and timeframe (see image in the comment). This is not good because for training the CNN we need to validate different ROI/matches as present to feed the model.

Steps to reproduce: Go to this PM job for instance and check that some matches have the same image, score and timeframe: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840

Expected result: All matches are different

Actual result: Some matches are duplicate.

Additional details: https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840

koonchaya commented 7 months ago

Older related issue https://rfcx.slack.com/archives/G016Y0627QW/p1640009028003000

koonchaya commented 7 months ago

Check for why this happen. For further investigation... @rassokhina-e Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users. @koonchaya Rerun the job. - we can see if the duplicate will happen again.

koonchaya commented 7 months ago

Test job created - job name: Test duplicate job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/58470

I checked 3 pages of the results did not see duplicate.

rassokhina-e commented 7 months ago

I found the original problem for the job https://arbimon.org/project/colombia-audiomoth-humboldt-inst/analysis/patternmatching/57840, there was a connection problem to the MySQL db and lambda was trying again and again to put the rois to the db. My point is that some of the insert requests were success and for this reason the job has duplicates.

Job ID: 84670 the original issue: Lock wait timeout exceeded; try restarting transaction lambda sends log to reconnect to db: Trying MySQL reconnection

Image

rassokhina-e commented 7 months ago

the count of inserting rois is 40 rows, this is a small count, we do not need to decrease it

Image

rassokhina-e commented 7 months ago

Run a script to check duplicate results in the projects (all project). - this can tell how much it is impacting the users.

I've run this query and I guess we can check this year, but it takes 1 day to check it, there are a lot of jobs, one query cannot support it

rassokhina-e commented 7 months ago

I found 32 jobs (from 7 to 12 of March) where rois are duplicated, some of the jobs have a small count of duplicates (2 rois, 6 rois), the insert count of PM rois is 40 rois, it means the problem with the analysis code, not with the MySQL error connection.

script:

SELECT pmr.*
FROM arbimon2.pattern_matching_rois pmr
where pmr.pattern_matching_id in (....)
group by pmr.pattern_matching_id, pmr.recording_id, pmr.x1, pmr.score having count(*) >= 2;

The duplicated jobs: 58081
58084
58080
58020,58021
58028,58038
58042
58043
58044
58045
58050,58051,58055,58056
58057,58058,58060,58068,58069,58071,57987,57991,57992,57990,57993,57994,57995,57996,57997,57999,58001,58005,57998,57973,57974,57975,57976,57977,57979,57980,57983,57986

Image

rassokhina-e commented 7 months ago

all these PM jobs above have the same problem with connection to MySQL

Image

Image

rassokhina-e commented 7 months ago

Nelson will check the analysis code today

Image

antonyharfield commented 7 months ago

We can see that there was a "lock wait timeout exceeded" for the insert with recording_id = 121015041 and x1 = 26.368. If we check the database then there are 2 rows for this entry:

select * from pattern_matching_rois where recording_id = 121015041 and pattern_matching_id = 57840 and x1 = 26.368;

@rassokhina-e, can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.

antonyharfield commented 7 months ago

I think the best solution is to use INSERT IGNORE feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE constraint on pattern_matching_id, recording_id, x1.

Proposed resolution steps:

antonyharfield commented 7 months ago

FYI @rassokhina-e @koonchaya, I think this will need at least a day or 2 to carefully update the database and then some more days testing the new code by running some big jobs.

We should also test all of this on staging before deploying to production.

rassokhina-e commented 7 months ago

can you check 5 other cases with "lock wait timeout exceeded" please, do they have duplicated rows? If so, then we can confidently assume that the duplicate rows occur when "lock wait timeout exceeded" is the response from the insert, and I can think of a few possible fixes.

I'm trying to find any lambda logs and see a lot of places with this issue, also related to MySQL connection

Untitled
rassokhina-e commented 7 months ago

Unfortunately, the cloud watch reports have time gaps, also the reports go each by every second with logs inside. It's really complicated to find the duplicates there. I followed by 2 latest jobs, but the result wasn't successful.

Untitled 2
rassokhina-e commented 7 months ago

I think the best solution is to use INSERT IGNORE feature of MySQL which will ignore any errors caused by duplicate keys. To get this to work we need a UNIQUE constraint on pattern_matching_id, recording_id, x1.

Proposed resolution steps:

Delete all duplicate rows from the table Add unique constraint Update code to use INSERT IGNORE

antonyharfield commented 7 months ago

Other possible solutions:

  1. Clean up duplicates at the end of the job. We could have an additional lambda to query for duplicates and then delete them
  2. Don't insert results to db during processing, we could save them to csv files in S3. Then we could have an additional lambda at the end of the job which loops over the csv files and inserts to the db in batches (retrying if the db is unavailable/overloaded).
rassokhina-e commented 7 months ago

@antonyharfield Can we share your latest proposal with Stas? Can he support us with this logic?

rassokhina-e commented 7 months ago

Stas suggestion from the eng connect call today:

Before sending the completed status to MySQL db from the PM lambda, check the duplicates and remove them if they exist in db for this particular job.

@antonyharfield

koonchaya commented 7 months ago

@antonyharfield @rassokhina-e Do we have any further update for this issue fix?

rassokhina-e commented 6 months ago

I guess we haven't planned to fix it yet