nespinoza / juliet

A versatile modelling tool for transiting and non-transiting (single and multiple) exoplanetary systems
MIT License
55 stars 30 forks source link

Excessive loading times for short-cadence curves #120

Open LucaNap opened 4 months ago

LucaNap commented 4 months ago

Hi everyone. I just wanted to say that when trying to fit 20-second TESS data, the "loading times" (before the fit actually starts) are very long for me (30-90 minutes??). I found this to be the case with multiple targets, even when there are only two sectors. Is this happening to me only? I am currently using version 2.2.4 of juliet in Windows.

LucaNap commented 4 months ago

Ok, I admit the solution might be as simple as remembering to remove all datapoints with flux=0, as it seems that juliet is starting way quicker now. In any case, there should be a warning somewhere if there's something wrong in the light curve (such as null fluxes) which make the loading time incredibly long.

Jayshil commented 1 day ago

Hi @LucaNap (cc @nespinoza),

I think this may be a real issue with juliet, and it likely happens because of a large number of data points. The total number of data points in one TESS sector is about ~50000 (or more) for 20-second data. When there are multiple sectors, this number increases rapidly.

juliet internally creates big time and flux etc. arrays that store data points from all instruments in one array. juliet currently uses two for loops to create these arrays and that seems to be taking too much time. In particular, I think the second for loop in Line 314 is problematic: it basically loops over all data points and append them in one big array. When the total number of data points is large, this for loop would take a ridiculously long time.

@nespinoza: this should be fairly easy to fix. The way I see it, we can simply remove this second for loop in Line 314 and replace it with np.hstack command like the following (the following code stars from Line 313):

for instrument in instruments:
    all_times = np.hstack(( all_times, t[instrument] ))
    all_y = np.hstack(( all_y, y[instrument] ))
    all_yerr = np.hstack(( all_yerr, yerr[instrument] ))
    all_instruments = np.hstack(( all_instruments, np.repeat(instrument, len(t[instrument]) ) ))

This is way more faster than the for loops. It doesn't matter in general when the total number of data points is not big, but it really starts taking too much time when the total number of data point increases. I am not sure if there were any particular reasons for using the for loops, but I'd suggest to use the code above. Let me know what you think about this.

Cheers, Jayshil