njcuk9999 / apero-drs

A PipelinE to Reduce Observations - The DRS for SPIRou (CFHT)
MIT License
12 stars 1 forks source link

v290 reprocessing killed #783

Open clairem789 opened 1 week ago

clairem789 commented 1 week ago

After about 6-12h of processing, I found the terminal with a "Killed: 9" and all terminated. In google it says that the application has received a signal... Not sure what to do and when this will show up again. Any clue? Maybe a memory leak? I'll start ip again with an eye on the activity panel to check memory. thanks

clairem789 commented 1 week ago

Possibly related to multiprocessing options once more, but I do not find those options anymore. Have there been changes on this topic from v289? thanks

njcuk9999 commented 1 week ago

Yer these Killed: 9 can be anything... from a command like killall -9 python or kill -9 {pid} or from a memory issue or many others - its something "outside" python hence no python error and python is just killed.

As for the multiprocessing the options are still there (they were never there by default I believe)

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
#   when parallelising recipes
#   dtype=string default=process
#   options = pool, process
REPROCESS_MP_TYPE = process

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
#   when validating recipes
#   dtype=string default=process
#   options = linear, pool, process, pathos
REPROCESS_MP_TYPE_VAL = process

You'll have to check your old setup but I think the REPROCESS_MP_TYPE_VAL on your machine had to be set to linear?

clairem789 commented 1 week ago

My past notes are not precise enough unfortunately. I thought that it worked with default options in the last version but I may be wrong. I restarted with pool option (val). If it fails again I'll try the linear.

larnoldgithub commented 6 days ago

I'm also moving to 290 and was wondering if REPROCESS_MP_TYPE_VAL shouldbe set to 'linear', as it's 'process' by default. 'process' seems to work for the minidata set, but for the full set of data ? I'll set the kw to linear.

larnoldgithub commented 6 days ago

I hed set it to linear since the beginning with the 288, as it failed with 'process'. what does 'pool' do ?

clairem789 commented 6 days ago

Status here: I tried all three options, and had a memory leak with all, even with : REPROCESS_MP_TYPE_VAL.value = 'linear' (in apero-drs/apero/core/instruments/spirou/default_constants.py) Unfortunately I don't know what to try next.

larnoldgithub commented 6 days ago

I launched apero_precheck after a fresh installation of the 290 and I can see the memory usage linearly increasing while APERO is updating the index db.

Screenshot 2024-10-03 at 19 41 05
clairem789 commented 6 days ago

@Luc, this is with any of the options for REPROCESS_MP_TYPE_VAL.value?

larnoldgithub commented 6 days ago

linear But the memory increase above is before anything with the processing: it's during the index db update with apero_precheck. is it an issue with mySQL ? claire when did your crash occur: preprocessing ?

clairem789 commented 6 days ago

My crashes come at the validation process like you. It was the case before (v284) with the "process" option but I thought it had been fixed at the 288 version.

njcuk9999 commented 6 days ago

I had set it to linear since the beginning with the 288, as it failed with 'process'. What does 'pool' do ?

process and pool are just two different ways of multiprocessing: https://stackoverflow.com/questions/18176178/python-multiprocessing-process-or-pool-for-what-i-am-doing

Possibly related to multiprocessing options once more, but I do not find those options anymore. Have there been changes on this topic from v289?

There have been no changes related to this but its a complex web (one which I'm definitely simplifying for v0.8)

My crashes come at the validation process like you. It was the case before (v284) with the "process" option but I thought it had been fixed at the 288 version.

This was never "fixed" as I still never got to the bottom of what was causing it - the "linear" option used to fix it for you so that seemed "enough" to have that (slower) option.

@larnoldgithub and @clairem789 can you both try with v0.7.289 and v0.7.288 and verify that the problem comes from v0.7.290 I'll have to go through all changes line-by-line to see what changed that could possible affect it.

clairem789 commented 6 days ago

So I guess that the issue is not seen on UdM machines...? I could try to do this test with 288, doing: git checkout v0.7.288-stable-test replacing process by linear in the default_constants for VAL processing the complete run, and checking the memory usage during the first hours. Will keep you informed!

njcuk9999 commented 6 days ago

You never should be changing the default_constants!!

Please use the user_config.ini and user_constants.ini files....

you can read the default_constants and default_config to look for constants to change but all changes should be added to the user_config.ini or user_constants.ini file (in your setup directory) - the values in user_xxx.ini will always overwrite default_xxx.py and also you wont be able to change branch if you modify the python files - so please don't do that!

i.e. add the following to user_constants.ini

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
#   when parallelising recipes
#   dtype=string default=process
#   options = pool, process
REPROCESS_MP_TYPE = process

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
#   when validating recipes
#   dtype=string default=process
#   options = linear, pool, process, pathos
REPROCESS_MP_TYPE_VAL = linear
njcuk9999 commented 6 days ago

So I guess that the issue is not seen on UdM machines...?

I haven't seen any such issues with NIRPS or SPIRou - though both machines have 300+GB of RAM.

NIRPS is only doing daily processing and I haven't done a large run for either NIRPS or SPIRou.

clairem789 commented 6 days ago

OK, sorry for my mistake in changing the wrong file. I'm doing this 2-3 times a year, not enough to remember all small details. It would be nice if it could explain it all, actually! The NW machine also has >300Gb but that's not enough. So at UdeM you haven't run the 290 version on the whole Spirou data set then?

njcuk9999 commented 6 days ago

Not a full reduction - the last runs have been done with v0.7.290 (though this is not as recommended as doing a full re-run) but again processing a single run may not show this issue as badly as redoing everything.

larnoldgithub commented 6 days ago

OK, sorry for my mistake in changing the wrong file. I'm doing this 2-3 times a year, not enough to remember all small details. It would be nice if it could explain it all, actually! The NW machine also has >300Gb but that's not enough. So at UdeM you haven't run the 290 version on the whole Spirou data set then?

I did the same error in the past... you should have somewhere a folder like .../config/myprofile/ where myprofile is the name of your 'installation' of apero, like offline290. in this folder you have a bunch of files: database.yaml install.sh install.yaml offline290.bash.setup offline290.sh.setup offline290.zsh.setup user_config.ini user_config.ini.org user_constants.ini user_constants.ini.org

the *.org are the original files I have cp just in case.

The way apero works is that it first reads the default and then updates the values with the user values, then starts the processing.

larnoldgithub commented 6 days ago

I did run the 288 with 'process' months ago, it crashed. I set it to 'linear' and has been very stable regarding the PROC at least? I didn't see any memory leak.

For my apero_precheck last night with the 290, the memory usuage increased linearly during the db update, then came back to the 'background level' of the machine. @njcuk9999 do you think this is expected behavior ? apero_precheck ended with no error. Screenshot 2024-10-04 at 07 53 20

I'll not be able to make a test with the 289 before next week.

clairem789 commented 2 days ago

So to close this issue, it was due to my mistake of modifying the REPROCESS_MP_TYPE_VAL option value in the wrong file (default instead of user's files). Incidentally I confirm that the linear option for this parameter is the one working for NewWorlds machine. Sorry for this!

njcuk9999 commented 2 days ago

Thanks for clearing this up, I'd rather this than trying to figure out what caused it to break in newer versions and not older ones!