kaylai / VESIcal

A generalized python library for calculating and plotting various things related to mixed volatile (H2O-CO2) solubility in silicate melts.
MIT License
26 stars 9 forks source link

Handle some samples being outside of calibration range for batch processing #46

Open kaylai opened 4 years ago

kaylai commented 4 years ago

If some samples in an ExcelFile object are outside of the calibration range for a particular model passed, how do we deal with this? One option is to return NaN and print something saying "Hey, by the way, we didn't calculate this sample because it's outside of the calibration range. If you really want to calculate it, call it as a single-sample".

Penny, I'm "assigning" this to you simply for visibility. We have been assigning each other to issues when we feel the other's input would be useful! --Kayla

PennyWieser commented 4 years ago

I'm not sure how I feel about returning a NaN and making the user continue to plug them in manually - if say they are working in a system which it outside all compositional ranges, but still want to start comparing the different models (e.g., my officemate uses decompress for exoplanets, i think with that logic it would always return Nans) - Also, as simon said in an email, I wouldn't want to annoy the original authors if they feel we've been a bit stingy with the ranges. For each model output, could we have a separate column with text "warnings" - see image below Outofrange

kaylai commented 4 years ago

I really like the idea of putting warnings into the returned data.

We do need to handle this for some cases where we simply don't return a result. For example, if the code errors on a sample for some reason, it will just break and won't calculate anything. I'd rather it just skip that sample and move on.

Simon, in particular I think that in Dixon, root_saturation_pressure was calling calculate_dissolved_volatiles and was passing a negative pressure. I got an exception:

if pressure < 0: raise InputError("pressure must be a positive value.")

Perhaps this is a unique case, and we can simply alter this exception to have different behavior? I saw this error when passing samples well outside of the Dixon calibration range. The values would have been completely meaningless (e.g., negative pressures).

simonwmatthews commented 4 years ago

I really like the idea of putting warnings into the returned data.

We do need to handle this for some cases where we simply don't return a result. For example, if the code errors on a sample for some reason, it will just break and won't calculate anything. I'd rather it just skip that sample and move on.

Simon, in particular I think that in Dixon, root_saturation_pressure was calling calculate_dissolved_volatiles and was passing a negative pressure. I got an exception:

if pressure < 0: raise InputError("pressure must be a positive value.")

Perhaps this is a unique case, and we can simply alter this exception to have different behavior? I saw this error when passing samples well outside of the Dixon calibration range. The values would have been completely meaningless (e.g., negative pressures).

There's an error in my implementation of Dixon Water which is leading to this particular error, and so that will stop being a problem as soon as I've fixed it (hopefully by tomorrow!). I think perhaps we can use "try:" and "except:" in the batch calculations to prevent a single error stopping the whole calculation. I think (though I would need to look this up to be sure) there is a way of accessing the exception that leads into the "except:" block; if so we can then retrieve this problem and report it back in the spreadsheet. I'll look into this (again hopefully tomorrow!).

PennyWieser commented 4 years ago

Definetly agree that if the model gets stuck it should print a NaN This is also really useful to keep all the data the same size (although this might be more of an issue with the way I handle data in matlab than python) - for my Hawaii stuff my input matrix might be like this where all melt inclusions have a "measured" major element composition I want a PSat for, but only some have a PEC-corrected composition (e.g., olivines not plag), its quicker to run it all through the solubiltiy code, and just get Nans for the PEC data, than have to resize the PEC corrected output so it matches the other and re-merge it with the other matrix. missing values

Simon - I solved a similar problem a while ago in the 1st version Kayla sent but it involved changing from append to .loc (although I wasn't sure if this was really necessary - but that was the rabbit hole I went down googling it). I have commented out the old lines and the new lines should be there - this probably doesnt help much, it was a hack in my 1st week of pythoning

`

satP.append(press) #Same as above changes

#flmass.append(fluid_mass) #Same as above changes
compo_input.loc[index, "SaturationPressure_MPa"] = press
compo_input.loc[index, "FluidMassAtSaturation_grams"] = fluid_mass

flcomp = melts.get_composition_of_phase(xmlout, phase_name='Fluid') # Also put this around it

# flH2O.append(flcomp["H2O"]) #same as above changes
# flCO2.append(flcomp["CO2"]) #Same as above changes
# This new bit basically breaks it if it can't converge, and stops it killing it all
try:
    compo_input.loc[index, "H2Ofluid_wtper"] = flcomp["H2O"]
except KeyError:
    pass
try:
    compo_input.loc[index, "CO2fluid_wtper"] = flcomp["CO2"]
except KeyError:
    pass`
kaylai commented 4 years ago

I think this should be fixed now, since none of the models should error out?

PennyWieser commented 4 years ago

I have this problem again (which might be related to #63 - if I have a dataset as follows, even though the 1st and 3rd row are okay, because the middle row has a NaN/0 value for CO2, and a low H2O, it returns the errors below and doesn't proceed to calculate the rest of the spreadsheet, so effectively returns no output. Is there a way to get it to move on from bad calculations and just spit out NaNs/0s?

EguchiCarbon - InputError: Concentration of CO2 must be greater than 0 wt%. Dixon - InputError: Pressure must be positive DixonCarbon - InputError: Dissolved CO2 concentration must be greater than 0 wt%. AllisonCarbon - InputError: Dissolved CO2 concentration must be greater than 0 wt%. IaconoMarziano - InputError: Pressure must be positive. Shishkina: InputError: X_fluid must sum to 1.0 ShishkinaCarbon: InputError: CO2 concentration must be greater than 0 wt%.

NaNTest.xlsx

kaylai commented 4 years ago

@PennyWieser @simonwmatthews I've fixed this. I implemented a try-except block for every ExcelFile calculation. Right now, I set it so that if the calculation fails for any sample, it returns a value (or values) of 0 and continues on to the next sample. At low volatile concentrations, I get 0's for lots of calculations for non-MagmaSat models. This must be related to #63.

Is returning 0's what we want? We could have it return a string saying "calculation failed" or something?

I also realized that the non-MagmaSat models were not returning values for the temperature, pressure, X_fluid that was set by the user, so I fixed that.

When we decide what to return (0's, some error string), I'll implement it and push the changes.

PennyWieser commented 4 years ago

i'd prefer it to return a Nan then a zero, because it means you can go right ahead and use the output xlsx. in other programs (e.g., matlab, excel, even python) to plot regressions, calculate means, sums etc. without having to remove all the 0s manually to avoid spurious statistical results.

kaylai commented 4 years ago

Agreed, returning 0 is confusing. Is returning NaN explanatory enough?

Update. I have changed all ExcelFile non-MagmaSat calculations to return NaN when they error on a sample, and then move onto the next sample.

However, MagmaSat calculations still return zeroes. It is not as straightforward to implement the return of NaNs with MagmaSat. For non-MagmaSat models, which we wrote, we can have them return an error if they fail. MagmaSat does not do this, due to some quirks in MELTS that have carried over. Instead, the ENKI thermoengine call returns zeroes. I don't want to blanket convert returned zeroes to NaN, since zero might be a valid result. The only calculation I could see implementing that on is calculate_saturation_pressure, which should really never be zero. I've gone ahead and implemented the return of all NaN values for that calculation if the saturation P is less than or equal to 0.

I hate that this is inconsistent, but I'm not sure I see a better solution. Thoughts?

PennyWieser commented 4 years ago

I agree for PSat replace 0 with Nan is great. I don't really get how the Magmasat calculation works, Is there a way to get it to tell you the row/sample that the error happened on? Or can you print an error in a new column, and then could that be used to convert outputs from any row with an error to NaN by replacing afterwards?

kaylai commented 4 years ago

Perhaps there is some way to query whether the MagmaSat calculation converged. I'll look into it.

kaylai commented 4 years ago

I looked into this a bit, but I don't think there is a straightforward way to test whether the MagmaSat calculations are failing, except for with the saturation pressure calculation, which I've already implemented a solution for.

We still need to test what happens when some samples in a batch calculation are outside of the calibration range, but we can't do that until we fully implement the calib range error handling. I will leave this issue open until that is solved.

kaylai commented 4 years ago

Circling back to this. @PennyWieser @simonwmatthews I think we decided that we want to allow the users to perform these calculations even when they are outside of the calibrated range, right? I just tried calculating dissolved volatiles for an ExcelFile object with Shishkina with a temp slightly out of range, and it returned NaN for all values. The warning also showed up, which is great! But, I think we want it to return the values, non NaN's, correct? If so I will look into fixing this.

PennyWieser commented 4 years ago

I think we agreed return values, always, rather than NaNs if the calculation does converge

PennyWieser commented 4 years ago

@simonwmatthews I have also possibly just encountered this problem trying to run the Eguchi calibratoin dataset. only 15 out of the 101 compositions in the calibration dataset returns a PSat. Is this that it hasn't found a solution, or is this that its deciding its out of range, and printing a NaN? These are the errors, but there aren't actually enough of them to explain the number of gaps, unless some errors apply to multiple rows image

simonwmatthews commented 4 years ago

The warnings are totally separate from the returned results- if a NaN is being returned it is because it isn't finding a solution. Can you send me the compositions that are failing please? (Otherwise I can't recreate the problem)

PennyWieser commented 4 years ago

Attached - I'm not sure what to do with H2O - in the original table it just said "n.d." for the ones that I have now replaced with 0s. Eguchi_Test.xlsx

simonwmatthews commented 4 years ago

Which rows are/aren't working? Just one non-working example will do.

PennyWieser commented 4 years ago

Sorry, I attached the whole sheet as almost all of them don't work. 1:3 - gives output, 4 no output, 5 output, 6:13 no output etc. Also even the ones that do output show a terrible agreement with the experimental pressure image (ignore the mac bit in title)

simonwmatthews commented 4 years ago

Ok, I have recreated it now. Starting a new issue.

In case it is helpful for reporting/finding issues, I thought I would explain how I find the origin of problems. The way I troubleshoot things is to call the function for a single sample directly. This means any error messages will be reported. When running a batch calculation, the errors are ignored and hidden, so it's difficult to know what's going on. (Perhaps a useful feature to add would be a way to record/report these errors if asked).

I opened an ipython session in terminal, and ran the following lines of code:

import VESIcal as v sample = {'SiO2':44.1,'TiO2':2.69,'Al2O3':12.8,'Fe2O3':2.69,'FeO':6.62,'MnO':0.19,'MgO':9.14,'CaO':14.34,'Na2O':3.2,'K2O':3.45,'P2O5':0.77,'CO2':2.85} v.calculate_saturation_pressure(model='EguchiCarbon',temperature=1300,sample=sample).result

This then returned the errors: /Users/simonmatthews/repos/VESIcal/VESIcal.py:2602: RuntimeWarning: overflow encountered in exp return Pnp.exp(lnfc) /Users/simonmatthews/repos/VESIcal/VESIcal.py:4157: RuntimeWarning: invalid value encountered in double_scalars CO2_CO2 = ((44.01XCO2)/(44.01XCO2+(1-(XCO2+XCO3))FW_one))100 /Users/simonmatthews/repos/VESIcal/VESIcal.py:4158: RuntimeWarning: invalid value encountered in double_scalars CO2_CO3 = ((44.01XCO3)/(44.01XCO3+(1-(XCO2+XCO3))FW_one))*100 /Users/simonmatthews/repos/VESIcal/VESIcal.py:1603: RuntimeWarning: The pressure is outside the calibration range of the Eguchi & Dasgupta (2018) carbon model, as nan bar is not between 500.0 and 50000.0 bar. The pressure is outside the calibration range of the Kerrick and Jacobs (1981) EOS model, as nan bar is not between 1.0 and 100000.0 bar. warnings.warn(self.calib_check,RuntimeWarning)

Based on the lines numbers, this tells me that something is going wrong before the fugacity function is called. (The final error reports the wrong EOS name, but I've fixed that now)

simonwmatthews commented 4 years ago

Warnings are now silenced when the code is run, but stored in the results DataFrame (and then saved in the excel file). Rather than having individual columns for each check, I've just had it record the full string warning for each calculation. However, having individual columns is definitely a straightforward thing to implement in the future- I will create a new issue for this as a future feature.

However... one slight complication arises... The batch calculations for MagmaSat are not done through the Calculation class interface, and so don't have the calibration checks performed. It would be easy to just add the calibration check code in, but I just wanted to check there's no reason not to just switch to using the Calculate class interfaces for MagmaSat too?

kaylai commented 4 years ago

@simonwmatthews Sorry for just getting to this. Yes, I think changing the ExcelFile class (batch MagmaSat calcs) to conform to the other batch classes makes sense.

I've implemented this on my local code, but it just about doubles the computation time, which is already long. The call to check the calib range, i.e.:

for cr in self.calibration_ranges:
     if cr.check(parameters) == False:
         s += cr.string(parameters,report_nonexistance)
return s

is taking about 4 seconds in a test I did on a single sample. That means in batch calculations, the calibration check is adding an additional 4 seconds per sample. Unfortunately, it's probably due to having to call the MagmaSat routine multiple times, and that's just unfortunately going to be slow due to how thermoengine is written. Profiling the code with cProfile reveals some objectiveC calls that are just slow. So that's not fixable by me probably ever.

Do you think it'd be possible to code up a calibration check routine unique to MagmaSat that checks the computed values after the calculation is complete? Just assert that each value is between lower and upper set bounds? That should be fast, and it could be implemented directly in the ExcelFile() class (then we just tell the code to skip the check_calibration routine if doing batch processing), but I don't know how to call the calibration ranges outside of their class, and I don't want to hard code in the values, since that's not very pythonic or extensible.

LMK what you think. We can chat on zoom/skype/slack if that's easier.

kaylai commented 4 years ago

@simonwmatthews I worked on making this faster, but I have not been able to. It's particularly an issue for MagmaSat calculate_dissolved_volatiles. Let me know if you come up with a solution regarding the calibration checks.

simonwmatthews commented 4 years ago

Sorry, have been meaning to get to this! I will try to work on it today.

simonwmatthews commented 4 years ago

@kaylai So, as far as I can tell the calibration checks are performed after the result is calculated, so I'm really not sure why the call to the calibration check is so slow. Is this still a problem now that I've re-worked the code? I don't think I've changed anything that would make a difference... but you never know!