Open DmitriGoloubentsev opened 2 years ago
Thanks for reaching out, Dmitri!
I think the point was to demonstrate GPU speed up rather than direct comparison to QL. We would very much welcome a contribution for a better CPU benchmark!
Sounds good! I'll come back to you later on CPU benchmark for this.
Also, you do not include graph optimization time into reporting. // # Second run (excludes graph optimization time)
I know, it's not dependent on number of paths, but it's still part of total pricing time. And for QL CPU execution it's 0.
Shouldn't you report this separately?
On second thought, if you simulate 100 time steps and only apply 1 exp() at the end, you don't really do much of calculations per path.
So your problem is basically reduced to RNG algo competition.
You should somehow increase complexity in your SDE. Perhaps, use the Heston local vol model to make this benchmark more relevant to real world. With flat vols, flat rates and a simple normal process, I don't know how relevant this benchmark is for practitioners.
What random generator is used if "PSEUDO_ANTITHETIC" is set? For QL you don't use antithetic. I suspect antithetic reduces number of required random numbers by 2... Am I correct in this?
There is an additional question about memory consumption, especially when run with XLA optimization. I hit with error message when run the example just with num_timesteps= 5000 without XLA: 2022-11-17 14:50:46.371456: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 4000000000 exceeds 10% of free system memory. Memory available 16G + 22G Swap. And there is a kernel crash with XLA run with these parameters. (ResourceExhaustedError: Out of memory while trying to allocate 72003200088 bytes. [Op:__inference_price_eu_options_1037]) Are there settings which controls limit for the memory allocation?
To answer @DmitriGoloubentsev question, yes, antithetic sampling uses fewer samples indeed. I think we could measure time it takes to simulate random numbers and then subtract that from runtime. I think at the time of writing that colab I was mainly motivated by GPU speed up and not comparing CPU performance. The colab can be extended to sample from Heston model as well. (just need to update GenericItoProcess
drift and volatility definitions).
As for graph compilation time, normally you would deploy a TensorFlow graph to avoid any compilation time overhead.
@SergK13GH , The samples are precomputed for vectorization purpose. You could switch to tff.math.random.RandomType.PSEUDO
and set precompute_normal_draws=False
in the sampler. We try to vecotrize computations where possible to ensure good GPU performance. One could ,of course, rewrite the whole thing using while loops but then you'd lose benefits of vectorization. As for memory controlling measures, I think you'd need to control it on your side.
As for graph compilation time, normally you would deploy a TensorFlow graph to avoid any compilation time overhead.
Sorry, can you please elaborate on what "deploy a TensorFlow graph" means?
Do you assume you can compile graph once and use it for all valuations in the future?
I can see how it may work for simple case (flat model parameters and the same number of time steps).
But am I right that in real problems you need to recompile graph everyday for all models and all trades? I.e. as trades age, model parameter interpolations change, trades cash flows are paid, simulation time points move (they are usually defined w.r.t. current time), you need to redefine valuation graph and hence recompile it.
I think you can only reuse valuation graph on the same trading day and it's still a good idea to report how much time and memory needed for this step.
Simulating normal process using Euler scheme for 1000 time step is a very basic problem. What happens when you have 1000 IR swaps to price for xVA? Your graph is going to be huge and compilation time significant regardless if you use GPU or CPU.
Hi guys,
In the "Monte Carlo via Euler Scheme" example you compare TF with QuantLib pricing and conclude that TF finance is x100 times faster(or more).
I want to note that in QL you evolve 100 time steps of Log Normal process, but in TF you work in log space and only apply exp() at the end. I agree QL may not be very fast, but in this example you compare 100 exponents per path in QL to just 1 exponent in TF...
Thank you!