njcuk9999 / apero-drs

A PipelinE to Reduce Observations - The DRS for SPIRou (CFHT)
MIT License
12 stars 0 forks source link

MySQL error after removing 9Tb of DEBUG* files in red/ (0.7.288) #734

Open larnoldgithub opened 9 months ago

larnoldgithub commented 9 months ago

I had only 8Tb of 37 remaining on SPIRou nas after the computing of FTTEMP2, and in order to make room on the nas before computing the CCF and building the products (which is about 6Tb, maybe more with the LBL?), I removed all DEBUGfits from the nights in red/ with rm -r red//DEBUG*fits This removed about 200,000 files and freed up about 9Tb on the nas.

As expected, at the start of the CCF etc. runs, APERO wants to update the reduced files db, but it fails and I get an error. mysql.connector.errors.OperationalError'>: 2055: Lost connection to MySQL server at 'localhost:3306', system error: The read operation timed out followed by thousands of red lines like

[...]
23:23:27.892-@@|PROC|W[10-002-00008]:       File no longer on disk  removing from file index database: /data/spirou4/apero-data/offline/red/20BQ04-Aug30/DEBUG_18F23078BF_pp_e2dsll_B.fits
23:23:27.921-@@|PROC|W[10-002-00008]:       File no longer on disk  removing from file index database: /data/spirou4/apero-data/offline/red/20BQ04-Aug30/DEBUG_18F23078BF_pp_e2dsll_A.fits
23:23:27.948-@@|PROC|W[10-002-00008]:       File no longer on disk  removing from file index database: /data/spirou4/apero-data/offline/red/20BQ04-Aug30/DEBUG_18F23078BF_pp_e2dsll_AB.fits
23:23:27.995-@@|PROC|W[10-002-00008]:       File no longer on disk  removing from file index database: /data/spirou4/apero-data/offline/red/20BQ04-Aug30/DEBUG_18F23078BF_pp_e2dsll_C.fits
00:23:44.305-!!|PROC|00:23:28.408-**|DatabaseError|E[00-002-00032] <class 'mysql.connector.errors.OperationalError'>: 2055: Lost connection to MySQL server at 'localhost:3306', system error: The read operation timed out 
00:23:44.365-!!|PROC|    Command: DELETE FROM findex_offline_db WHERE BLOCK_KIND="red" AND (OBS_DIR="21BQ17-Dec09" AND FILENAME="DEBUG_2691761o_pp_e2dsll_B.fits") OR (OBS_DIR="21BQ17-Dec09" AND FILENAME="DEBUG_2691761o_pp_e2dsll_A.fits") OR (OBS_DIR="21BQ17-Dec09" AND FILENAME="DEBUG_2691761o_pp_e2dsll_AB.fits") OR (OBS_DIR="21BQ17-Dec09" AND FILENAME="DEBUG_2691761o_pp_e2dsll_C.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663927o_pp_e2dsll_B.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663927o_pp_e2dsll_A.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663927o_pp_e2dsll_AB.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663927o_pp_e2dsll_C.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663926o_pp_e2dsll_B.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663926o_pp_e2dsll_A.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663926o_pp_e2dsll_AB.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663926o_pp_e2dsll_C.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663915o_pp_e2dsll_B.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663915o_pp_e2dsll_A.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663915o_pp_e2dsll_AB.fits") OR (OBS_DIR="21BQ11-Oct19" AND FILENAME="DEBUG_2663915o_pp_e2dsll_C.fits") OR (OBS_DIR="21BQ11-Oct13" AND FILENAME="DEBUG_2662646o_pp_e2dsll_B.fits") OR (OBS_DIR="21BQ11-Oct13" AND FILENAME="DEBUG_2662646o_pp_e2dsll_A.fits") OR (OBS_DIR="21BQ11-Oct13" AND FILENAME="DEBUG_2662646o_pp_e2dsll_AB.fits") OR [...]

I tried to reduce the number of nights, but it didn't help, APERO wants to update its db not only for the night Im asking, but for all the nights it has in its db.

MYSQP parameter 'max connection' is currently 1024 and I'm not using the parallel option during the PROC stap at the beginning.

It seems also that, while I get a lot of messages like this one: 23:23:27.667-@@|PROC|W[10-002-00008]: File no longer on disk removing from file index database: /data/spirou4/apero-data/offline/red/20BQ04-Aug30/DEBUG_2510286a_pp_e2dsll_B.fits the file isn't actually removed if an timeout error occurs later in the process. I restarted many times APERO and got the same warning message for the same files. Not sure when exactly APERO erases the files from its DB.

maybe the only solution would be to apero_reset the db ??? and hope APERO will be able to rebuild everything and complete at last the CCF and the rest of the processing ? @clairem789 @CharlesCadieux @andrescarmona

larnoldgithub commented 9 months ago

@cusher

larnoldgithub commented 9 months ago

I think I'll reset the db... and cross my fingers for a smooth rebuilt by APERO

larnoldgithub commented 9 months ago

MySQL has a pb with APERO at CFHT. even an apero_reset.py of the db fails.

Screenshot 2023-11-06 at 16 39 07
CharlesCadieux commented 9 months ago

Hello Luc, we rarely have (but it happens sometime) this MySQL timeout error in Montreal. Usually, re-launching the same APERO recipe works. I agree that this is a MySQL problem and I am not very sure what is the next thing to do. Let us know if you were able to complete the apero_reset of the database by increasing the 'max connections' parameter.Maybe it has to do with parallelization, maybe try with 1 core if you still have a timeout error?

larnoldgithub commented 9 months ago

I re-ran the reset and it worked. max-connexion has been already increased up to 768 and even 1024 recently, but I still had issues with the time-out error. The pb I have now is when apero comes to updating the red/ directory database, since I removed the DEBUG*fits, the update never ends successfully. I'm completely stuck: whatever I do, apero requires the db to be updated first and it fails, so I can't even redo the extraction to restore those DEBUG for example.

cusher commented 9 months ago

A little bit of follow-up on this: the MySQL query to delete these rows was hanging (or at least had been running for >4 hours). Checking the running queries, I was able to see that the query itself was 18MB large (i.e. 18 million characters long). I'm not sure if MySQL was just choking on parsing such a long query, or it just very slow to go through ~250000 OR conditions, but either way it definitely was not happy attempting to run that. I restarted the MySQL server and ran a manual query using LIKE 'DEBUG_% in place of all the ORs and it completed after a bit under 2 minutes.

larnoldgithub commented 7 months ago

just to update this ticket: The restart and manual query done by Chris fixed the issue.

clairem789 commented 7 months ago

I got the same issue after cleaning DEBUG files as well. Chris' fix did work for me, ie: port unload mariab-server and load (equivalent to restart) from the DB server, I then deleted DEBUG entries from the findex table. APERO seems happy with this. Thanks!!

njcuk9999 commented 6 months ago

Just to mention it is probably better to turn off DEBUG files in the first place (user_constants.ini).

DEBUG_BACKGROUND_FILE = False
DEBUG_E2DSLL_FILE = False
DEBUG_SHAPE_FILES = False
DEBUG_UNCORR_EXT_FILES = False

However currently there is a bug with this (See #740) so don't do this until this is fixed!