Open ojschumann opened 1 month ago
That's a peculiar problem. I have seen a few instances where it can become slow like this, but I have not figured out why that happened. I would not be surprised if it has to do with the use of the 2D any_views that are used to expose the data on the Python side.
Can I ask you to try out the develop branch? Once you have compiled the Python bindings, you will also need to take the tools..so file in the _deps/tools-build subdirectory and put it with the ENDFtk..so file for the ENDFtk bindings to work in this new version.
This new version replaces the range-v3 dependency and replaces it with another implementation and I'd like to know if this issue persists with this new version.
Sure you can ask me to test the dev branch ;-)
I had some issues with the spdlog library, that was found in several of the build directories in my source folder (like in some njoy and ACEtk directories). Seems to be some weired cmake thing.
But finally I managed to compile the development branch. Figures are:
1.15 s for the yield data and 245 µs for the fission cross section. So speedup is about 3.3 for the fission yield data and 6 times for the 1d cross section (which is a quite small vector with 710 entries). My guess is, that there happens a lot of object creation/destruction on access to the 2d data.
spdlog has a tendency to install itself in the .cmake/packages folder (as does Eigen it would seem). The new version of ENDFtk has an updated CMakeLists.txt files that ensure that no longer happens. So hopefully, you'll never worry about that anymore.
I'm happy to see these speedups. Seems like with removing range-v3 we did something right :-)
Still, this 2D data seems to be a problem. I might just add a fission_yield_values and fission_yield_uncertainties property to extract the data as a single column - if that is something you would be happy with.
ok, already learned something about cmake ;-)
I don't know, how "fixed" the interface of ENDFtk is. So changing from 2D data to two (fast) 1D arrays would be totally ok for me from the users perspective. But I could also just wait the ~2 minutes my script is running.
From the programmers perspective I could not be happy with it. I need a nice, fast solution, no workaround ;-) Few more numbers: The FissionYieldData::Y() method gets only called 77 times during the python access, so no excessive calls on that side. And if I program a c++ version of my python code, I can access all data in ~16 ms, so no flaw at that side. Thus I suspect the performance issue entirely in the python access to DoubleRange2D
Is this DoubleRange2D a fixed thing? I had very good experience with exposing Fortran and C++ structures to python via the xtensor library. Data shows up on the python side as numpy arrays and could also be read-only. If you do it right, there is no data copying at all and you can do the calls in ~µs. If you are interested, I could try to get a proof of concept working. Then you can decide, if you like to go with that.
BTW: Is my c++ version the intended syntax? Or is there an easier way to deal with the std::variant
#include <ENDFtk.hpp>
#include <iostream>
#include <chrono>
int main() {
auto tree = njoy::ENDFtk::tree::fromFile( "/opt/test/nuclear_data/GEFY61data/gefy61_nfy/Pu239" );
auto mat = tree.materials().front().parse();
auto file = std::get<njoy::ENDFtk::file::Type<8>>(mat.file(8));
auto section = std::get<njoy::ENDFtk::section::Type<8, 454>>(file.section(454));
std::chrono::time_point<std::chrono::system_clock> start =
std::chrono::system_clock::now();
for (const auto& y : section.yields()) {
double yield_sum { 0.0 };
for (const auto& yd : y.Y()) {
yield_sum += yd[0];
}
std::cout << "sum of yields: " << ++n << " " << yield_sum - 2.0 << "\n";
}
std::chrono::time_point<std::chrono::system_clock> stop =
std::chrono::system_clock::now();
int dur = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
std::cout << "time to process yields: " << dur << " µs\n";
return 0;
}
Before I get into implementation details, let me address your question about the variants. This is indeed one of the ways of using variants (the other one being the use of std::visit), although you can get away with a single std::get in this case instead of two. You can actually write the following:
auto tree = njoy::ENDFtk::tree::fromFile( "/opt/test/nuclear_data/GEFY61data/gefy61_nfy/Pu239" );
auto mat = tree.materials().front();
auto section = std::get<njoy::ENDFtk::section::Type<8, 454>>(mat.section(8,454).parse());
or even:
auto tree = njoy::ENDFtk::tree::fromFile( "/opt/test/nuclear_data/GEFY61data/gefy61_nfy/Pu239" );
auto section = std::get<njoy::ENDFtk::section::Type<8, 454>>(tree.materials().front().section(8,454).parse());
Basically: you can get to the ENDF material/file/section that you want to interact with and then parse it so there is no need to parse the MF8 data in its totality. Most of the time, this is what I do: go to the section I need in the tree and then parse it.
For the topic at hand. The fission yield data (and data similar to it) consist basically of 4 columns of information:
The ENDF LIST record stores this data as a flat array (which is how we read it). In the latest release of ENDFtk, we use range-v3 to interpret the data in this flat array using lightweight iterator-based views. range-v3 is a library that was used as a prototype for the C++20 and C++23 ranges standard - although range-v3 offers a lot more functionality than the C++20 and C++23 ranges standard provide. The replacement for range-v3 in the develop branch is a C++17 implementation of parts of the C++20 and C++23 ranges standard (we cannot move to C++20 or even C++23 just yet due to constraints at LANL).
The views in range-v3 and the views in the C++ standard allow us to rearrange the flat array in a more convenient way without having to rearrange memory. To access the 4 columns, we can write code like this:
values | std23::stride(4) // for column 1
values | std20::drop(1) | std23::stride(4) // for column 2
values | std20::drop(2) | std23::stride(4) // for column 3
values | std20::drop(3) | std23::stride(4) // for column 4
or when we look at the yields and uncertainties as a pair:
// first: drop 2 values
// second: transform the resulting sequence of values into a sequence of pairs
// third: transform the sequence of pairs into a sequence of pairs at index 0, 2, 4, ...
values | std20::drop(2) | std23::chunk(2) | std23::stride(2)
These "views" generate composite types (templated types) that are different depending on the order of sequencing in the operations and the operations themselves. In the first example given above we use two different type of views. The second example introduces a third type.
We have to provide Python bindings for each one separately. However, they always look like "arrays": they have a begin() and end(), an operator[], a size(), etc. (it does depend on the underlying interators on what interface is available). To simplify the bindings, we decided to use type erased arrays. In range-v3, this is provided by any_view and in the develop branch by njoy::tools::AnyView. Both are implementations of a type erased view. Type erasure does require a lot more pointer indirections.
Before returning a C++ view, we transform it into an appropriate AnyView, e.g. AnyView< random_access, double > for a view of double values that can be accessed as a random access array (akin to a vector). As a result, we only have to provide bindings for a few different types such as DoubleRange = AnyView< random_access, double >, IntRange, etc. And of course their 2D and 3D equivalents.
All this to say this: the DoubleRange, DoubleRange2D, etc. are basically implementation details that we used to bind the various views coming out of ENDFtk. And as your numbers show, it is not a bad solution for the 1D versions. For the 2D version, something internally (most likely in the any_view/AnyView's type erasure) is causing a big issue.
Yes, I already saw the ranges, but I'm not very much used to that construct (yet)
I further pinned down the problem. It is basically related, how the array() loads the values. It iterates over the first dimension and sees an iterator as the "value". So for each dimension, it calls the len method (maybe to check, that this is no ragged nested sequence) then iterates along the second dimension. To load all 209132 values of my yield data file, a total of 104643 calls to len and the same number to iter habe been made. And if I understand the output correct, then array() and ENDFtk both live in c++ space, but the loading is conveyed though python, so there is a C++ - Python - C++ context switch for every call, in total 208k! This would be 2.6 µs per switch, which is not that bad. So the underlying problem is the "list of list" approach, which does not scale good with 2D data and numpy arrays. And I guess, things get worse with 3D. I doubt, that you can ever get a good performance with it. Unfortunately I have to work tomorrow and can't play the whole day with this problem. But I may implement a quick solution with xtensors in the evening, only to show my idea.
For my question, I found a faster way: As I only need the fission yields and not the uncertainties, I can save a lot of work by only iterating along the first axis and then access the first value of the list directly, thus avoiding a lot of contex switches.
Y[n] = array(yd.fission_yields) # my initial version, needs 1.14 s on develop branch for all 77 yield data
Y[n,:,0] = [x[0] for x in yd.fission_yields] # needs only 0.17 s for all 77 yield data
For ENDFtk 1.1.0 the numbers are 3.95s and 1.51 s respectively, so the development code is way faster.
I did quick proof-of-concept patch for one double and one int : fc0a400383cd2d5acbb5d0ca4329daf5b7f48c8c It uses the pybind11::array to generate a numpy array object, that points to the data. For doubles, this is essentially zero-copy. As ints are stored as doubles, here always a copy has to take place!
Benchmarks for the Fission Yields:
for n,yd in enumerate(y.yields): # iterate over all energy dependent yields
Y[n] = yd.Ypy # 835 µs ± 1.59 µs
Y[n,:,0] = [x[0] for x in yd.fission_yields] # 169 ms ± 307 µs
Y[n] = array(yd.fission_yields) # 1.16 s ± 1.71 ms
Access to a single element
%timeit yd.Ypy[500,1] # 531 ns ± 3.93 ns
%timeit yd.fission_yields[500][1] # 3.31 µs ± 11 ns
Benchmark for the Fission ZAIDS For the integer array, numbers are closer together
Loading of the complete array
%timeit zaid = yd.ZAFP.to_list() # 25.7 µs ± 45.1 ns
%timeit zaid= array(yd.ZAFP) # 430 µs ± 1 µs
%timeit zaid = yd.ZAFPpy # 1.89 µs ± 5.43 ns
Interesting: .to_list() much slower than loading by numpy.array
Access to only a single element:
%timeit z = yd.ZAFP[1305] # 1.6 µs ± 1.47 ns
%timeit z = array(yd.ZAFP)[1305] # 436 µs ± 421 ns
%timeit Z = yd.ZAFPpy[1305] # 1.92 µs ± 8.4 ns
Again, the iterator implementation with the numpy array is faster than the iterator alone. This time, the "zero-copy" solution is worse
Last remark: I just realized, that I did not include a return value policy.
Very interesting. I would have thought the type erasure to be the biggest culprit.
When we wrote the bindings for these ranges, we did not know pybind11 as well as we could, so it might be a good idea to revisit these bindings.
That being said, I'm working on a new library that would completely remove the concept of "formats" (https://github.com/njoy/dryad, latest branch is fix/review). It basically provides a format agnostic data layer that I can populate using ENDF files and ACE files (or any other format that can be added later). I can interact with both ENDF and ACE files using the same interface. Due to my user experience with ranges (they are useful on the C++ side but exposing them to the Python side is tricky), I am not using them in this new library as return types. I'm basically returning const vector& all the time instead of a type of view or range (I can do this because I can lay out the data library the way I want). As a result, we shouldn't have these issues in that library - at least I hope. It cannot read the entirety of ENDF and ACE just yet (it does electroatomic and photoatomic data so far) but in the current state of that library, I might try to do some timing tests like you did.
I heard to your talk about dryad in the MCNP Symposium and found that very interesting.
If you are still in the design phase and cat change things: I think, xtensor very nice, it is like Eigen, but with "unlimited" dimensions, and you can also pass these very easy to python with zero-copy, and they appear as numpy array (very similar as the bybind11 solution). If you you use these replacement for std::vector<> you have very easy 2D and 3D containers.
But did I understand you correctly: Dryad will eventually replace ENDFtk?
From my side, this issue is basically settled: To increase the performance, you could change the design for 2D and 3D double views, but it's entirely up to you, if you really want to go that way. Depends maybe also on your intended main use of ENDFtk: More a C++ with optional python binding or more a python lib....
But thank you for all the nice comments and explanations!
Dryad is intended as the backbone for a modern NJOY, and to help users not familiar with the different formats and add more capabilities like linearization, etc. As such, it will "replace" ENDFtk. Dryad uses ENDFtk under the hood to populate the data layer so we'll never get rid of ENDFtk - although the Python interface might become less important to us.
I'll have a look at xtensor to see if it is something we can use. Thank you for pointing to that.
I'll still add additional interface functions to access the yield values and uncertainties as 1D arrays to ENDFtk following this discussion. The current interface will stay the same so you won't be affected.
Dear all,
while using the ENDFtk lib, I stumbled across a performance issue, when assessing fission yield data. That is about a factor of 9 slower than e.g. accessing a cross section data. Is there a better access pattern that what I did or is this an issue with ENDFtk?
What I did
I used the TENDL2017 and GEFY 6.1 Fission Yield data from the Fispact website. I accessed these ENDF files with ENDFtk through python. My IPython session looks like the following:
This is "only" 415 kbyte/s transfer rate. On the other hand, when I load a cross section:
I get about 3,8 Mbytes/s.
Any idea why yield data access is that slow? From my experience it is possible to transfer large amount of data from the C++ side to python in µs, when one can prevent a copy. And even with a copy, the nearly 4 seconds for some yield data is quite long..