idaholab / raven

RAVEN is a flexible and multi-purpose probabilistic risk analysis, validation and uncertainty quantification, parameter optimization, model reduction and data knowledge-discovering framework.
https://raven.inl.gov/
Apache License 2.0
218 stars 133 forks source link

Computing Sobol Indices through a ROM #645

Closed AlvaroRDP closed 6 years ago

AlvaroRDP commented 6 years ago

Computing Sobol indices through a ROM and a given set of data

What did you expect to see happen?

I'm currently trying to use the HDMR ROM in order to be able to compute the Sobol total indices. I've already managed to do so by using a Sobol sampler, generating a corresponding ROM and printing these values as an outstream. However, I'm currently using a given set of data (several hundreds of experiments with 5 input values and one output), and my objective is to be able to do the exact same thing, computing the Sobol index of each input variable from the information given by the set of data. All in all, my idea of the process consists in importing said set of data from a database, generating the HDMR ROM, and finally export the indices as an outstream. Is it simply not possible to compute the Sobol indices without a Sobol sampler or is there a way to do so with an already given set of data without sampling from distributions?

Summary : I expected to be able to generate an HDMR ROM from a given set of data (as it is possible with other ROM types) in order to be able to export the Sobol indices given by the Reduced Order Model as an Outstream.

What did you see instead?

The code asks for data extracted from a Sobol sampler, it doesn't let me build the ROM with data that was originated from another sampling strategy. The error message is:

RuntimeError: ROM has not yet been initialized! Has the Sampler associated with this ROM been used?

Do you have a suggested fix for the development team?

I've also tried to import the data as a custom sampler, but the result is the same.

Please attach the input file(s) that generate this error. The simpler the input, the faster we can find the issue.

codexml

the database that I use has this form (the image is from the equivalent .csv file) :

database


For Change Control Board: Issue Review

This review should occur before any development is performed as a response to this issue.


For Change Control Board: Issue Closure

This review should occur when the issue is imminently going to be closed.

PaulTalbot-INL commented 6 years ago

Hi, AlvaroRDP, and welcome!

Because of the tight restrictions on the samples taken by the Sobol sampler in order to be used in the HDMR ROM, in RAVEN we require the ROM be hooked to the sampling strategy. However, this won't be as big of an issue as it sounds like!

When I want to construct an HDMR ROM from existing samples, I run the samples through a "restart" of the Sobol sampler. That is, create a MultiRun Step with the Sobol sampler connected to the HDMR ROM you want, and then set the DataObject with the pre-sampled data as the Restart for that Sobol sampler. This way, you won't have to run your code again, and the samples will be pre-treated with the necessary information for the ROM to handle.

Also, in general this kind of discussion might best be had through the user mailing list (inl-raven-users@googlegroups.com) before creating an issue. That's fine, though, we're happy to help either way.

Let me know if you have any additional concerns!

AlvaroRDP commented 6 years ago

Hi, Paul! Thank you very much for your answer, I think I understood the idea of how to do it, and I'll be sure to post it on the mail list the next time. However, for some reason I still get the exact same error message. What I've done according to your suggestions is : -In the node Samplers : add a Restart node with the DataObjects that contains the input values and the associated outputs. -In the node Steps : Create a MultiRun node with the following nodes : Sampler : where I use the Sobol sampler defined previously with the Restart node Input : The PointSet with the input data *Model : I don't know what should be used in this case, since I already have the desired inputs and outputs as a PointSet from the extracted database.

PaulTalbot-INL commented 6 years ago

Yes, there's some settings that need to be propagated from the Sobol Sampler into the HDMR ROM, so the points must be sampled through a MultiRun step that includes the Sampler being restarted using your existing points. Which Model you pass the points through doesn't really matter, since they'll end up being restart points anyway; if you didn't sample these points from RAVEN using a Model, then you can use the "dummy" pass-through model.

For example, let's assume I have the sampled points in a PointSet called "mySamples". My MultiRun step might look like the following:

<MultiRun name="restart">
  <Input class="DataObjects" type="PointSet">mySamples</Input>
  <Sampler class="Samplers" type="Sobol">mySobol</Sampler>
  <Model class="Models" type="Dummy">myDummy</Model>
  <Output class="DataObjects" type="PointSet">restartedPoints</Output>
</MultiRun>
AlvaroRDP commented 6 years ago

Alright, thank you very much for your help, I think I've managed to actually perform the process of passing the data through a multirun with the dummy model, the data stays unchanged (the Sobol Sampler has restart points) and up until here there seems to be no problem. However, when performing the modeling of the HDMR ROM (step RomTrainer), it still get an error (a different one from before though). It is a KeyError in this case, so I don't know if it's related to the internal coding in Python of RAVEN or I'm passing the information in the wrong way. I've attached the screen capture with the error message in case it is a trivial thing that anyone already knows how to solve.

Anyway, if you have some spare time to take a look at the small code that I made, I would be very grateful if anyone could tell me where the main error is located, since honestly I'm blocked here. I shall keep trying to figure it out in any case. The .xml file that I'm using is also attached to the message. Thank you very much for your time.

fromDataToSobol.zip

errorsobol

PaulTalbot-INL commented 6 years ago

I ran your case and got the same error, so we should be on the same page.

I'm looking at the output from the run when I run it, however, and not seeing any restart points in the step GenerateData; rather, the DataObjects reports "No matching restart point found (floats)" for every point, which means the points required to train the ROM are not present in Out.

Where did the data for Out come from originally? Is it a model you ran in a different RAVEN run, or are these samples collected independently?

AlvaroRDP commented 6 years ago

Yes, I see it too, for some reason the code doesn't recognize the restart points. Each of the input variables stored in Out comes from a typical Monte Carlo sampling from five Normal distributions of mean zero and sigma 20, and the associated response Y is generated through an external model that is nothing but a linear combination of the five input variables. This regression model is just a simple one I used in order to test the code, and its formula is as follows :

Y = 0.1*X1 + 0.5*X2 + 7*X3 -2*X4 +3*X5

To summarize, I sample from normal distributions each variable from X1 to X5, and then for each MC Sampling I compute the associated output Y. The database generated stores all these values and it's according to them that I want to compute the Sobol indices.

In case it helps, I've also attached the code where I create the MC Sampling, the outputs of the External Model, and generate the associated database.

SamplingFromModel.zip

AlvaroRDP commented 6 years ago

Alright, regarding the problem with the restart points, I think I found the issue, be my guest to correct me if I'm wrong. The problem was that since the Sobol sampler has always fixed points in space that are going to be used for the sampling (based in Sobol sequences I imagine), as it is also the case for other samplers such as the Grid-based ones, but not for pseudo-random samplers (such as Monte Carlo), if we try to impose a certain set of points for the Sobol sampler through a restart, the code automatically detects that those restart points do not belong to the set of data that it necessarily needs to use. The easiest way to solve this problem is to use a restartTolerance node within the Sobol sampler with a fairly generous incertitude, so that each time the sampler finds that the restarted points do not match the ones it expected, it will use the closest ones included in its predefined Sobol sequences to the restarted points; closest meaning with an Euclidean distance smaller than what we impose through the restartTolerance node.

All in all, in order to be able to use the restarted points we will approximate them to the closest point included in the Sobol sampler respecting a certain tolerance. However, when I do that, even though there are no more problems with the restarted points, I still get an error when trying to train the HDMR ROM:

KeyError: (0.0, 0.0, 0.0, 0.0, 0.0)

If anything I wrote does not seem to be right, feel free to correct me. Thank you very much.

PaulTalbot-INL commented 6 years ago

Ah, yes, our implementation of the HDMR ROM makes use of Smolyak sparse quadrature to construct generalized polynomial chaos representations of the data, so it uses fixed points. We do not currently have a tool to compute Sobol coefficients directly from arbitrarily sampled data.

There is a fairly easy workaround for this, especially if you have a pretty good sampling of the space. You can train a surrogate model (K nearest neighbors is the simplest, but whatever you might want to use) on your data, then use that surrogate as a model to be sampled by the Sobol sampler at quadrature nodes and collect points to train the HDMR ROM, which will give the Sobol statistics.

This is quite similar to the greatly expanded restart tolerance idea you suggested above, but will guarantee that you get the right points sampled. It does introduce whatever error is accrued from training the surrogate on your experimental data points, but you also end up with a surrogate of your experiment as a bonus.

Let me know if this is something you want to pursue and if you get stuck in the details, we can help you sort out the workflow. I've added a note to find and implement a tool that would compute the Sobol coefficients from arbitrary data; however, I have some deadlines approaching and may not be able to get to it for a bit. The workaround should produce some useful results, I think.

PaulTalbot-INL commented 6 years ago

(note for myself to check on this later)

SALib is pretty lightweight (under 1 MB), MIT license, and should do what we want. However, it is in conda-forge, not straight conda.

AlvaroRDP commented 6 years ago

Alright, I've implemented your suggestion and there were no problems. I trained a Kriging surrogate model with the original data, and afterwards sampled through Sobol to train an HDMR model and compute the indices. As you said, there's obviously a small discrepancy between the values of the indices calculated directly from the original model and those obtained by sampling from the Kriging ROM; however, the relative error stays well below 0.1% in this case, so at this point your proposed solution was more than satisfactory.

Thank you very much for all your help, if there's anything you may want to add or you want to contact me for anything else do not hesitate to do so.

PaulTalbot-INL commented 6 years ago

Glad it worked out! Thanks for your feedback.