`ModelSelector()` fails silently and does not write to `Models/`

ketozhang commented 2 days ago

In some microlensing light curves, I am getting RTModel silently failing despite all stdout printing success.

Here an example in PSPL:

rtm = RTModel("myevent")
rtm.Reader()
rtm.InitCond()
rtm.launch_fits("PS")
rtm.ModelSelector("PS")

The Models/ directory is missing

Data/
ini/
InitCond/
PreModels/
FilterToData.txt
LCToFit.txt
spline.txt

The logs reports everything is success

00:07:37.110  *********************
00:07:37.110  ****   RTModel   ****
00:07:37.110  *********************
00:07:37.110  Event name: /redacted/path/to/myevent/
00:07:37.110  Number of processors: 4
00:07:37.110  - Launching: Reader
00:07:37.110    Pre-processing data...
00:07:37.110    OK
00:07:37.110  - Launching: InitCond
00:07:37.110    Setting initial conditions...
00:07:37.110  Peaks:  2459631.0000  2459628.0000  
00:07:37.110    OK
00:07:37.110  - Single-lens-Single-source fits
00:07:37.110  
00:07:37.110  Fits completed:   0%|[32m          [0m| 0/120 [00:00<?, ?it/s]
00:07:37.110  Fits completed:   2%|[32m▎         [0m| 3/120 [00:00<00:16,  7.18it/s]
00:07:37.110  Fits completed:   5%|[32m▌         [0m| 6/120 [00:00<00:14,  7.75it/s]
00:07:37.110  Fits completed:   8%|[32m▊         [0m| 10/120 [00:01<00:12,  8.86it/s]
00:07:37.110  Fits completed:  12%|[32m█▏        [0m| 14/120 [00:01<00:12,  8.83it/s]
00:07:37.110  Fits completed:  15%|[32m█▌        [0m| 18/120 [00:02<00:14,  7.05it/s]
00:07:37.110  Fits completed:  18%|[32m█▊        [0m| 22/120 [00:03<00:14,  6.99it/s]
00:07:37.110  Fits completed:  22%|[32m██▎       [0m| 27/120 [00:03<00:12,  7.63it/s]
00:07:37.110  Fits completed:  28%|[32m██▊       [0m| 33/120 [00:04<00:11,  7.64it/s]
00:07:37.110  Fits completed:  33%|[32m███▎      [0m| 40/120 [00:05<00:11,  7.23it/s]
00:07:37.110  Fits completed:  39%|[32m███▉      [0m| 47/120 [00:06<00:09,  7.66it/s]
00:07:37.110  Fits completed:  46%|[32m████▌     [0m| 55/120 [00:07<00:08,  7.86it/s]
00:07:37.110  Fits completed:  52%|[32m█████▎    [0m| 63/120 [00:08<00:07,  7.84it/s]
00:07:37.110  Fits completed:  59%|[32m█████▉    [0m| 71/120 [00:08<00:06,  7.99it/s]
00:07:37.110  Fits completed:  66%|[32m██████▌   [0m| 79/120 [00:10<00:05,  7.35it/s]
00:07:37.110  Fits completed:  72%|[32m███████▎  [0m| 87/120 [00:11<00:04,  7.37it/s]
00:07:37.110  Fits completed:  79%|[32m███████▉  [0m| 95/120 [00:13<00:03,  7.14it/s]
00:07:37.110  Fits completed:  86%|[32m████████▌ [0m| 103/120 [00:14<00:02,  6.88it/s]
00:07:37.110  Fits completed:  92%|[32m█████████▎[0m| 111/120 [00:16<00:01,  6.91it/s]
00:07:37.110  Fits completed:  99%|[32m█████████▉[0m| 119/120 [00:17<00:00,  6.68it/s]
00:07:37.110  Fits completed: 100%|[32m██████████[0m| 120/120 [00:18<00:00,  6.55it/s]
00:07:37.110  - Selecting models for Single-lens-Single-source fits
00:07:37.110    OK

ketozhang commented 2 days ago

Not addressing the issue itself, but this has been quite difficult to debug due to lacking error handling in RTModel. Here I have to understand what ModelSelector() is writing to a new directory Models/, so an actual success is only indicated by the existence of files in this directory.

A lot of the code in RTModel Python looks like this

https://github.com/valboz/RTModel/blob/08a8f8c06a680cee6f9a293937f5e824cca645f2/RTModel/RTModel.py#L266-L269

Notable issues are:

When C++ executable returns non-zero, an error is printed out. If instead, we raise an Exception here, we get the benfit of both notifying and giving control of error handling to the user.
stdout=subprocess.DEVNULL further hides any useful information from the C++ executable. Changing this to capture_output=True and passing stdout and stderr to python logging module would allow users more control over its verbosity.

Example of useful logs in ModelSelector not printed out by default for users

``` ********************* **** RTModel **** ********************* Event name: /tmp/pytest-of-root/pytest-6/test_levmar_solution_shape_PS_0 Number of processors: 4 >>> rtm.ModelSelector("PS") - Selecting models for Single-lens-Single-source fits ****************************************** ********** Model selector ********* ****************************************** This program finds the best independent models of a given class - Event: /tmp/pytest-of-root/pytest-6/test_levmar_solution_shape_PS_0 - Reading options in ModelSelector.ini Number of sigmas used to declare overlap between two models: 3.000000 Number of sigmas in chi square distribution for model acceptance: 1.000000 Maximum number of models reported: 10 - Model code: PS Reading data Reading models ! No models for this class OK ```

I'm happy to contribute error handling code.

valboz commented 2 days ago

Hi @ketozhang, you are right I did not spend too much time to make the output from the C++ code available to the user, but it is indeed very useful to track simple errors. You are welcome to change the error message to a regular exception, since I am currently busy with the improvement of the efficiency.

In your specific case, it seems that no fits were actually executed. Are your data in the correct format?

ketozhang commented 1 day ago

Hi @valboz,

From what I can tell in the logs and output directory, I believe the fit did ran. There are subfolders under PreModels/.

I turned on the stdout of ModelSelector and this is what it returns:

******************************************
**********     Model selector    *********
******************************************

This program finds the best independent models of a given class

- Event: /tmp/pytest-of-root/pytest-6/test_levmar_solution_shape_PS_0

- Reading options in ModelSelector.ini
Number of sigmas used to declare overlap between two models: 3.000000
Number of sigmas in chi square distribution for model acceptance: 1.000000
Maximum number of models reported: 10
- Model code: PS

Reading data

Reading models

! No models for this class  OK

valboz commented 1 day ago

Subfolders with files inside? Can you send me one of these files, e.g. "PreModels/PS0000/0.txt"?

ketozhang commented 1 day ago

Here is one of those files 0.txt

...and I attached the entire RTModel results directory results.tar.gz

valboz commented 1 day ago

Great! I see that your LCToFit.txt has unreasonably small errors. It seems that for some unknown reason the re-normalization of the error bars failed in Reader. However, I have run RTModel on your original data and I obtain a perfectly fine LCToFit.txt. Can you remove all files and try again and tell me if you obtain exactly the same LCToFit.txt?

On a minor point, you should use HJD-2450000 to allow for more digits in the time stamps to be recorded in double precision.

ketozhang commented 1 day ago

Good catch. I also notice that the first row is correct then followed by bad rows:

0 2.4595805000e+06 1.7481053421e-09 1.4973600043e-13 0
0 2.4595805901e+06 1.7485240090e-09 7.7283465933e-94 0
...

I have experimented with what you suggested. I attempted with multiple copies of the results, remove all files but the Data directory, then run Reader. This consistently return a good LCToFit.txt file for the ~100 attempts I did.

If I go back to my script and run this in the suite of data I have. I get non-deterministic failures. Every failure has bad LCToFit.txt similar to the example above.

HJD-2450000 Noted. Thank you!

valboz commented 12 hours ago

If I go back to my script and run this in the suite of data I have. I get non-deterministic failures. Every failure has bad LCToFit.txt similar to the example above.

That's very strange and I am currently unable to reproduce this bug. However, since this seems to depend on error-bar renormalization, you may try to switch off renormalization by rtm.config_Reader(renormalize = 0)

valboz / RTModel

`ModelSelector()` fails silently and does not write to `Models/` #14