Open twVolc opened 1 month ago
Removing duplicates with
self.calibration_series = self.calibration_series[~self.calibration_series.index.duplicated(keep='first')]
as the final line of PyplisWorker.load_cal_series()
has worked, but I won't add the fix and close this issue immediately as it may be worthwhile briefly looking into why these duplicates are appearing in the first place - are they happening regularly in the latest dev?
Was this with post processing or RTP?
Post processing
Okay, I'll see if I can replicate it and get to the bottom of it. I suspect it is related to some other issues too.
Brilliant, thanks! I have a draft PR #238 that fixes it by checking for duplicates, following the earlier comment. But as I say, it might be nice to prevent this happening in the first place if possible. This PR might be useful as a final fail-safe though as it makes sure things can be loaded.
Like I say, I've no idea whether this is a niche issue or is regularly occurring in current dev - I didn't test extensively. I'm not sure what aspect of the dataset (or what I did wrong during processing) would cause this.
I've not been able to replicate it with a quick run of the Lascar data. I'll try a longer run and see if it crops up. In the meantime could you see if I have access to the X: drive so I can take a look at the data if needs be?
Also could you post the config used when generating the full calibration? Just so I can make sure that I've not got something switched on/off that may be causing this?
For some reason github doesn't let you attach .yml files...so here it is as .txt process_config.txt
Two things jump out of the config:
plot_iter
and I tend to leave it off. Not sure how that would affect things but I'll see if it makes a difference.fix_fov
set to false
which, if I remember correctly, means that it does the search to find the most likely point in the image that the spectrometer may be pointing. I think I've just left that as true for all my testing so there may be issues there that I haven't spotted before. Not sure exactly how but it is more directly relevant than 1.No joy with any of the above so I had a look at the data. Current working theory: It looks like its caused by missing spectra time points.
The data I use is complete so I wouldn't have come across this before, but it's easy to simulate and is a good thing to test in future.
Ah good spot! So if there isn't a spectrum for a specific image-pair the interpolation method is duplicating the results somehow?
No, it's because nothing is added to the calibration for the missing time point, but the iterative saving will still add the most recent row from the calibration array: The times and values from the previous time point.
Just need to add something to skip the iterative saving for calibration when a spectra time point is missing.
Ah yes that makes sense. I guess the calibration value for that time point should still be saved - i.e. iterative saving should still save something, even if it's a duplicate value as the one before just with the new time asocaited with it. This lets the user know what value was used to calibrate that image pair, so that a preloaded calibration would have a calibration value for each image pair. Does that make sense?
Okay, so you'd expect something that looks like this:
Yes that's perfect. As long as I'm right in thinking that is how those data points are calibrated - with the coefficients of the previous data point? If so, yes I think this is the best way to deal with things.
I did a cursory check on Wed to see if this was the case and I think it is, but I'll double check today. I assume calibrating with the last known calibration coefficients is the desired behaviour, rather than dropping the time point entirely?
Yep that's what we'd want. Calibration is based on so many prior points anyway, to generate the regression, so it wouldn't make sense to drop a data point just because we don't have a specific spectrum for that time.
Ignore this, just seen in the code that if you have cal_type_int
set to Preloaded and the calibration series loaded then it doesn't need the DOAS results
~Tangentially related to this: Should loading a calibration file also load the DOAS results? Currently it doesn't but the data is contained within the calibration file, I'm not sure if it's even required if you have the calibration data available.~
Yep you're right. No need for DOAS results if you have the calibration file.
One more thing: When using Preloaded calibration with a missing timepoint currently it will use the closest value to the missing one, but in the case of ties it will go for the larger index. i.e. If there is an image pair for 16:00:40 without calibration coefficients and there are existing values at 16:00:35 and 16:00:45, currently the code will choose the values at 16:00:45. I'm guessing this in an unintended effect, as when it does the calibration without the preloaded file it will use the last available set of coefficients.
Yeh I guess taking the early index would be more logical, but I'm not sure there could really ever be a case where this happens unless someone is doing someting very odd with the data? To generate the calibration file you need to process the data, so that should contain all of the data points. To acheive the above you'd have to process the data with some files missing, then insert those files and process again? Not something to worry about really I don't think, but if the fix is incredibly obvious this could be updated.
To generate the calibration file you need to process the data, so that should contain all of the data points
Prior to the iterative saving changes I think the calibration file could contain missing data points (if there is no associated spectra for that time point), so this could affect calibration files generated by older versions of the software.
Also, in the current code if there is a spectra time point missing before all the data is ready to save then it will just skip that timepoint entirely, and when you load that saved calibration it will use the following timepoint rather than the pervious one to calibrate the image.
Next question: what should it do if there is a missing spectra value before the calibration has been run? Or does it not matter as these timepoints are discarded when the full config is loaded anyway?
Next question: what should it do if there is a missing spectra value before the calibration has been run? Or does it not matter as these timepoints are discarded when the full config is loaded anyway?
I think I'm not 100% clear on what this means sorry. Maybe we can make a note to discuss in the meeting? I'm guessing you mean something different to the below?
Okay, so you'd expect something that looks like this:
I suppose the overarching questions are: Should there be a time point in the calibration file for every image pair processed? Is this regardless of whether the spectra for that time point is available?
Yes, if there's an image pair then we should be able to process, regardless of if there's a spectrum, as we calibrate with span of data not just with a spectrum from a specific time. so every image pair should be processed and have the coeffs used to calibrate that image saved along with the MSE and r-squared for the set of claibration data that generated those coefficients. The optical depth and column density columns will be blank if there's no spectrum, just like the screenshot you linked.
Okay, that makes sense. Currently in the full calibration we include the time points prior to the initial calibration (e.g. first 10 mins of data) for each image, with blank coeffs, MSE, and R-squared. But this will only be included if the spectra for that time point is available, if the spectra is missing should I keep it as is or should there be a blank time point there? e.g.
vs
Or does it not matter?
The optical depth and column density columns will be blank
Another question that just occurred to me: Should optical depth be blank? Is it calibrated and so relies on the column density, and therefore must be blank. Or is it the raw value, and so may be useful to record in the data?
Another question that just occurred to me: Should optical depth be blank? Is it calibrated and so relies on the column density, and therefore must be blank. Or is it the raw value, and so may be useful to record in the data?
Optical depth could always be present, but it's not critical as it doesn't give a huge amount of information without an associated spectrometer column density. Leaving it blank might be clearer.
But this will only be included if the spectra for that time point is available, if the spectra is missing should I keep it as is or should there be a blank time point there? e.g.
Adding the empty data point might make more sense - it then makes it clear the range of times (and so emission rates) that are available in that sequence. I can't remember if we discussed this before or not, but it might even be useful to go back through all of time points that aren't calibrated and attach the calibration they end up with - i.e. once enough data has been gathered to calibrate, the software loops back through the buffered data and calibrates it, so that we get an emission rate from each time point, so saving these values in the coeff etc columns might make sense? I have a feeling we've discussed this before and maybe came to the opposite conclusion, so feel free to disagree. This might be a bit more complicated with the iterative saving, as those rows would initally be saved without the calibration columns - I wonder if that's why we decided not to save the calibration values? It's not something that is worth sinking time into, as worst case scenario I can manually copy up the earliest calibration values to the rest of the data, or even without this I think the software would take the nearest calibration point for a preloaded calibration run, so would automatically find that data point anyway, even if there's no calibration data in the early rows.
Okay, got it. I'll work on that today.
once enough data has been gathered to calibrate, the software loops back through the buffered data and calibrates it, so that we get an emission rate from each time point, so saving these values in the coeff etc columns
I think we did discuss it but didn't come to a conclusion, and so just left it blank for the sake of ease.
This might be a bit more complicated with the iterative saving, as those rows would initally be saved without the calibration columns
It should wait until the initial calibration is done before it starts saving iteratively. so it should be possible to add the first calibration values to the calibration time series in memory before the file is first saved.
It's not something that is worth sinking time into, as worst case scenario I can manually copy up the earliest calibration values to the rest of the data, or even without this I think the software would take the nearest calibration point for a preloaded calibration run, so would automatically find that data point anyway, even if there's no calibration data in the early rows.
I'll need to restructure things a little anyway, so I'll see if it fits easily, but if not I'll leave it.
Does this file look okay to you? There are 3 missing spectra points: 16:14:20, 16:17:20, and 16:39:40
Yep that looks good to me!
I just want to make sure it's reasonably consistent for times when the FOV is not fixed, and for RTP, then I'll put in a PR for it.
One thing I forgot to mention in our chat earlier:
If you have remove_doas_mins
set to a smaller value than doas_fov_recal_mins
(e.g. remove_doas_mins = 10
, doas_fov_recal_mins = 30
and do a fov search. then only the 10 mins of data before the fov search will be used for the calibration (the first 20 mins are effectively discarded).
Does that seem right to you? Not sure if you'd ever use that situation in practice.
It's probably a situation we wouldn't ever come across, like you say, although it sounds like the way it handles this is actually the way you might expect, so that seems good to me. i.e. you've request a 30 minute time window to find out where your FOV is, but then only want a 10 minute time window to actually calibrate the optical depths. Again, very unlikely to ever do that, but I think it's correctly handled here.
I'm not quite sure how this has happened, whether it unique to the data I'm working with or whether this is a new bug caused by some code update. My
full_calibration.csv
file for a latest bit of processing has lots of duplicate times in the file, which is leading to an error when trying to load the calibration. It hits the error inline 2431
(in dev) ofso2_camera_processor.py
:closest_index = self.calibration_series.index.get_indexer([self.img_A.meta['start_acq']], method='nearest')
Throws the error:
This means the file can't be loaded at all.
Diving into this method I can see that it requires indexes to be unique (checks
is_unique
) which aren't the case in this latest calibration.The data I'm processing can be found on the X drive - Reventador/2023/2023-04-20/Seq_2. As far as I can see there's nothing strange about the data itself. full_calibration.csv
From what I can see, the duplicate times only start appearing after the DOAS FOV calibration has taken place (after 30 minutes in this case). I wonder if it's something to do with the new iterative way of saving data if that occurs here? All duplicate times appear to have identical values for all columns.
Perhaps the easiest solution is to put a check in that removes duplicate indices in the
calibration_series
when loading, although it would be nice to know why this is happening.