rfcx / arbimon

Ecoacoustic analysis platform empowering conservationists to analyze acoustic data and to derive insights about the ecosystem at scale
https://arbimon.org
Apache License 2.0
0 stars 0 forks source link

Investigate project backup export from biosoundscape project #2037

Closed koonchaya closed 3 hours ago

koonchaya commented 2 weeks ago

Original report: https://rfcx.slack.com/archives/C03FD1WD02J/p1718028385095719 @carlybatist

User reported that number of recordings and templates exported from the project didn't match data in the project. Project https://arbimon.org/p/biosoundscape/overview

"When I made a backup of our BioSoundSCape project, the zip did not include the recordings.csv that has the AWS links to recordings. This file is invaluable. Also, it doesn't appear that we got a full list of templates in templates.csv. Maybe this is because there is a lot of data in BioSoundSCape?" This project - https://arbimon.org/p/biosoundscape/overview

Additional information I exported the files from the project and found that the number of recordings and templates didn't match. Export file https://drive.google.com/file/d/1_l2LmgDrn_CIMDx23Bq5JWIlVhmF15A2/view?usp=sharing

grindarius commented 1 week ago

I dug up the logs and found that it's an error at the SQL level. Most statements failed to run when the server is likely at its max. See error images below...

Screenshot 2567-06-14 at 13 03 13 Screenshot 2567-06-14 at 13 03 36 Screenshot 2567-06-14 at 13 03 49 Screenshot 2567-06-14 at 13 04 04

It's an error from legacy, but since our backup system is designed to be fault-tolerant, they can still do an export even if all queries failed. But when the error comes, the next batch won't be queried. So if the first batch fail you will get nothing.

The problem is the query took too long to do so. I guess there are a couple places where we can improve it.

carlybatist commented 1 week ago

@grindarius ok how long do you anticipate it will take to implement these fixes so that we can check if they work in fixing the issue? We need to be able to have large projects work with the backup. And if there is an error where not all rows are going to show up, there needs to be an error message to the user demonstrating that. The user only realized this was a problem when they double-checked the CSVs against the project data.

antonyharfield commented 1 week ago

Reduce database chunk query to something like 50k

That sounds reasonable.

This query is very light because the only ordering is by PK so it shouldn't need to do much to read this data. It more likely failed because there was a lot of db activity at the same time. We could try some exponential backoff: if a query fails then retry in 10 sec, then 20 sec, then 40 sec then 80 sec else fail completely.

If one of the queries fails then I think the whole job should fail -- we don't want to continue and send the user incomplete data.

@carlybatist We are going to need this week to work on some improvements.

koonchaya commented 1 week ago

@antonyharfield @grindarius To find the solution for the job fail case.

koonchaya commented 1 week ago

Email fail status to user and support@rfcx.org/slack

koonchaya commented 1 week ago

Draft email to notify failure export:

Subject: Arbimon project export failed

Hello,

Thanks so much for using Arbimon! We encountered an issue while backing up your project '...'. Our apologies for the inconvenience. Please contact our support team at [contact@arbimon.org] for assistance.

@antonyharfield @carlybatist Can you check if this message need any changes?

carlybatist commented 1 week ago

@koonchaya Tech team would be getting this error notification too right? They should then immediately start looking into it as a support ticket. So I would think the email to the user should be informing them that there was an error and that our team is looking into it and will update them. Noon and I should be auto-cc'd on these emails to users too. So it would be --

Hello,

There was an issue with your project backup of '...'. Our engineering team is looking into this and will update you when we have resolved it. We apologize for the inconvenience and thank you for your patience!

All the best, Arbimon team

koonchaya commented 1 week ago

Ideally, @carlybatist and I would get the email that forwarded from support@rfcx.org. I am not sure about the eng-team will get alert elsewhere.

carlybatist commented 5 days ago

@koonchaya @grindarius what do you expect the timeline for fixing the underlying issue will be?

koonchaya commented 4 days ago

I tested export backup from project https://staging.arbimon.org/p/bci-panama-2018/overview @grindarius here is some feedback

playlists.csv

pattern_matchings.csv

pattern_matching_rois.csv

recordings.csv

recording_validations.csv

Image

rfm_models.csv

sites.csv

soundscapes.csv

species.csv

templates.csv

koonchaya commented 2 days ago

@grindarius

pattern_matchings.csv

pattern_matching_rois.csv

playlists.csv

rfm_models.csv

rfm_classifications_001.csv rfm_classifications_002.csv rfm_classifications_003.csv rfm_classifications_004.csv rfm_classifications_005.csv rfm_classifications_006.csv rfm_classifications_007.csv rfm_classifications_008.csv

templates.csv

grindarius commented 1 day ago

For pattern matchings export, we grab data directly from pattern_matchings table. What you see in the UI are jobs that are joined with the jobs table. But in pattern_matchings table we can get all the jobs from there even some jobs are not related to the jobs table. Same goes for pattern_matching_rois.

For playlists export I did find both playlists inside the export file so it's all good.

For rfm models there are deleted models being exported into the file.

Same goes for rfm classifications, we did not have condition to remove deleted classifications out.

koonchaya commented 1 day ago

@grindarius

RatreeOchn commented 3 hours ago

Released on v1.4.2