scipy / xsf

Special function implementations
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

RFC: Plan for establishment of xsf library for special function scalar kernels #1

Open steppi opened 2 weeks ago

steppi commented 2 weeks ago

Some background and history

In August of 2023 Irwin Zaid (@izaid) made the following comment on the pull request https://github.com/scipy/scipy/pull/19023,

Something tangential: I know this issue is about the array API, but it might be good to explore whether the C versions of the special functions can be made very easily "copy-and-pasteable" to include in other array libraries. This way, although the C implementation isn't a defined standard itself, it would help out a lot of other projects if SciPy has something other libraries can easily steal. Example: how much of CuPy's special implementation is just taken directly from SciPy's, with __device__ in front of everything? Could SciPy just have a lot of header files for special functions that other projects can easily include?

For context: the array API, short for the Python array API standard, was proposed to reduce fragmentation in the Python array computing world by offering a common API for an important subset of array computing functionality. The goal is that code written to the array API standard should work with any conforming array backend (e.g. NumPy, PyTorch, CuPy, Jax, etc.). Although the array API standard is restricted to a core set of important features, there can be optionalextensions, standardizing coherent sets of functionality. There are currently extensions for Linear Algebra and Fourier transform functions. It seems natural that there should be a special function extension[^1] codifying consistent naming and argument conventions for a set of the most widely used special functions, but due to inconsistent support for special functions across array libraries, an extension restricted only to functions which are currently uniformly available among the most widely used array libraries would be missing many useful functions. It is much easier for these array libraries to add a new special function when there is an existing implementation that can be re-used with at most small modifications.

At the time, SciPy could not provide header files for many of its special functions that other projects can easily include because SciPy special was a polyglot submodule with special function scalar kernels written in each of C, C++, Cython, and Fortran 77. Scalar kernels written in Cython and Fortran 77 would require translation to run on the C/C++ based CUDA platform for use in GPU aware array libraries, and we observed that it was precisely the functions with such scalar kernels that were most likely to be missing from other array libraries.

Replacing the Fortran 77 codebase in scipy.special had already been under active discussion because the Fortran scalar kernels, particularly those from Zhang and Jin's[^2] specfun codebase, were a source of a sizable number of bug reports, which were challenging to address due to the need to work with old, unstructured, and often low quality Fortran 77 code. My first major contributions to scipy.special had begun in June of 2021, when I began rewriting the Fortran scalar kernel for hyp2f1 into Cython[^2], which was the prefered language at the time for new scalar kernels in scipy.special. Through this work, I'd already begun to think that C++ might be a more suitable choice due to features like overloading on input type, template metaprogramming, better cross-platform support for complex numbers, its extensive standard library, because it can lead to significantly smaller binaries than Cython, and because of the challenges of calling scalar kernels written in Cython from compiled code not written in Cython.

In an October 2023 meeting between Irwin, myself, and fellow SciPy maintainer Matt Haberland (@mdhaber) to discuss the draft proposal for a special function array API extension Matt was working on, Irwin explained his vision and I suggested we start by translating special function scalar kernels that are written in Cython into C++. I posted an RFC, https://github.com/scipy/scipy/issues/19404, asking for thoughts on rewriting these Cython scalar kernels in C++, which was well received. I followed up by translating lambertw from Cython into C++, https://github.com/scipy/scipy/pull/19435, and then began working on translating other scalar kernels. Irwin in turn followed up with https://github.com/scipy/scipy/pull/19601, adding CUDA support for this lambertw implementation. In January 2024 I submitted https://github.com/cupy/cupy/pull/8140 to add lambertw to CuPy. Starting in November 2023, I was able to work on this project 18 hours per week with support from the 2020 NASA ROSES, Reinforcing the Foundations of Scientific Python [^4] through my employer Quansight Labs, aiming to bring GPU support to as much of scipy.special as feasible. Funding for travel, hardware, cloud, and other miscellaneous costs in service of this project has been provided by the 2023 NumFOCUS small development grant Streamlined Special Function Development in SciPy.

Irwin has devoted a great amount time to seeing this project through on a volunteer basis when his work allows.

We are now in a position where it is within reach to add GPU support for all special functions in SciPy except those where the need to manage nontrivial amounts of state makes this unworkable. It was a great help that SciPy maintainer Ilhan Polat (@ilayn) translated all of the Fortran code in scipy.special into C as part of his general effort to free SciPy from outdated and difficult to maintain Fortran 77 code ^5. Boost Math has recently begun to offer GPU support for special functions and internal tools used within special function code ^6, which means we can build upon their work.

Over the past year, @izaid and I have frequently made comments on GitHub stating our goal to split the C++ special function code we've been working on into a separate library which could be a shared dependency (as a submodule) of SciPy, CuPy, and other array libraries, and there is support for this from the CuPy developers ^7. This RFC is the first step towards really making this happen. The name xsf was chosen by @izaid, based on the name sf for the C++ standard library extension for mathematical special functions.

Concept

We propose splitting SciPy special C++ codebase @izaid and I have been working on into a header only C++ library of inline special function scalar kernels which could be included and used in either C++ or CUDA projects (either with NVCC or the NVRTC jit. The identification macro __CUDACC__ can be used to detect if a CUDA source file is being compiled, and __CUDACC_RTC__ can be used to detect if the NVRTC jit is being used. Conditional compilation can then be used to adapt the headers for the platform on which they are running.

The simplest adaption is to define a macro XSF_HOST_DEVICE which expands to __host__ __device__ when a CUDA source file is being compiled, and otherwise an empty string.

#ifdef __CUDACC__
#define XSF_HOST_DEVICE __host__ __device__
#else
#define XSF_HOST_DEVICE
#endif

Host refers to the CPU side of a system and device to the CUDA capable GPU side. Defining a function with both the __host__ and __device__ qualifiers tells the compiler the compile the function for both the CPU and GPU (having both is needed for some array libraries). Function signatures in the xsf library then look like:

XSF_HOST_DEVICE inline std::complex<double> lambertw(std::complex<double> z, long k, double tol)

Note that the inline keyword is used here not because we expect the compiler to inline the long and somewhat complex implementation of lambertw, but because this is what's needed in a header only C++ library to allow the same function to be used in multiple translation units under the One Definition Rule.

We rely heavily on NVIDIA's libcudacxx [^8], an implementation of much of the C++ standard library that is part of the CCCL (CUDA core compute libraries) suite. Since we only implement scalar kernels, no explicit management of parallelism or other GPU specific programming techniques are needed; it's sufficient to write scalar kernels which use the subset of the standard library that is supported on NVRTC [^9], and which do not make nontrivial memory allocations, use recursion, or use other features which are not compatible with the highly parallel but resource constrained nature of GPU computation [^10].

Currently a header config.h is used whose essential purpose is to include the headers xsf uses from the standard library when compiling for CPU, but for CUDA, to instead inject the standard library functions, templates, and types xsf uses from the libcudacxx's cuda::std namespace directly into the std namespace. This was a convenient because it let us write our code as if it was just regular C++ code that uses the standard library, but it is generally not considered good practice to modify the std namespace, and doing so leads to undefined behavior according the C++ standard ^11. Since these headers will no longer just be part of SciPy's (and CuPy's through copy and paste) internal codebase, we should adopt a more principled approach to aliasing to the correct standard library utilities for a given platform.

Scope

We hope to include scalar kernels for all scipy.special ufuncs and gufuncs in xsf, and to enable GPU support for as much of them as feasible. Since @mborland and @jzmaddock have also begun adding GPU support to mathematical functions in Boost Math, a question that could arise is why there needs to be a separate project doing this. I think it's important that we coordinate with them to avoid duplication of effort. I envision Boost Math being a dependency of xsf, and for us to wrap Boost Math when it provides a quality GPU capable implementation of a function. Currently scipy.special provides many functions which are not available in Boost Math and provides complex-valued versions of others for which Boost Math only provides real valued ones. There may be other functions for which the Boost implementation cannot be easily made GPU capable, but for which we would like to maintain a GPU capable implementation. Over time, I'd hope to see work done in xsf make it's way into Boost in some form when that makes sense, but

  1. @izaid and I are more familiar with the xsf (née SciPy special) codebase
  2. I don't expect it to be straightforward to graft most code from xsf into Boost due to differences in architecture and style
  3. We wouldn't want to overwhelm the Boost Math developers with a flood of work to review.

So for the time being, I think it makes more sense for the work on xsf to proceed independendly of Boost. We will be able to move more quickly this way. In the future I anticipate a natural delineation will form over what makes more sense to be implemented in Boost vs xsf based on differences in project goals and standards.

Migration Plan

The initial steps would be:

  1. Copy the folder scipy/special/xsf in the SciPy repo to include/xsf in the xsf repo, taking care to preserve the commit history of scipy/special/xsf, perhaps using steps from an answer to this relevant StackOverflow question.
  2. Add a test suite to xsf. The simplest thing would be to xsf's ufunc creation tools to create Python wrappers and test these through pytest. To get started, we could just copy over the relevant scipy.special tests, making the minor changes to them that would be needed to get them to work. In the future we would build out a more thorough and consistent test suite.
  3. Set up CI on GitHub over a range of platforms. We could start only with CPU CI, since the goal is just to replicate what we currently have in SciPy, but in the not too far future the goal would be to set up GPU CI as well, following the approach used in scikit-learn ^12.
  4. Go through the process of setting up xsf as a submodule of SciPy and CuPy.

There are headers in scipy/special/xsf which are not GPU capable, or which are in principle, but use things from the standard library which are not yet available in CuPy (config.h as it currently exists on SciPy's main branch is one of them). Compilation with NVRTC will fail at runtime if one tries to call a CuPy ufunc with scalar kernel including one of these headers. This is something which will need to be solved before xsf can be included as a submodule of CuPy. Currently what we're doing is taking stuff like this out of the config.h that we copy into CuPy when submitting PRs to add functions there. Instead we should either avoid things that are not available in CuPy (or perhaps spend some time figuring out how to get them to work in CuPy).

Priorities for the rest of the year.

Very roughly, the current status of the scipy special ufunc codebase is:

  1. All of cephes has been translated into C++ ^13, and most of it should work on GPU in principle (except things like distribution functions for the kolmogorov distribution which depend on double double arithmetic, and the real valued implementation of hyp2f1, which uses recursion), but we haven't tested all of it.
  2. All of amos and specfun have been translated into C++, but only a few things in specfun and none of amos, have been made GPU capable.
  3. An improved version of root finding infrastructure needed to get cdflib working on GPU has been finished, and gdtrib has been made GPU capable ^14. I've translated much of cdflib's internal code (for functions that get inverted with the root finding tools), but have found that cdflib's implementations tend to be worse than the implementations in Boost and cephes, so none of this work translating cdflib has found it's way into SciPy.
  4. Scalar kernels for lambertw, wright_bessel, digamma, loggamma, hyp2f1, sici, sinpi, and cospi have been translated from Cython to C++, but a handful remain, the most complicated being orthogonal_eval.pxd for orthongal polynomial evaluation.
  5. Faddeeva (complex valued statistical distribution functions) is already implemented in C++, but these have not been moved into headers in xsf and made GPU capable.
  6. We have not yet attempted to use the scalar kernels SciPy takes from Boost on GPU.
  7. Elliptic integral implementations in ellint_carlson are written in C++, but we have not moved them into headers in xsf or made them GPU capable.

Through the end of the year I have funding for 287.89 hours of work from the 2020 NASA ROSES grant. There are 14 weeks remaining in the year, which means I am funded to spend 20.5 hours per week for the rest of the year. All of the work above should

For the remainder of the year, I'd like to prioritize the following things [^15]:

  1. Writing quality documentation. It's important to write quality documentation for how to use the xsf library, but moreso, I want to emphasize documentation for how to contribute to xsf, how to use it's internal tools, the development workflow, advice for special function implementors. Over the years, SciPy has had many people show hope who wanted to help fix things in special, but weren't able to due to the burden of working with, and maintainers needing to review updates to the messy, at the time mostly Fortran, special codebase. I think there's a lot of potential to find people who'd like to help out, as long as it's straightforward enough for them to get started.
  2. Development of a comprehensive test suite, with comparisons against arbitrary precision reference implementations [^16], over test cases which are automatically sampled based on the domains of the functions and configuration information such as whether we want to sample in linear or logspace. Additional test cases for critical regions where a function is known to be difficult to compute could be added manually. Such a test suite will make it much easier to review code for xsf, which could help a prevent a review bottleneck from slowing our development pace and discouraging contribution.
  3. Not of lower priority than the previous two items, and something which can be done within a relatively short timeframe: figuring out how to get Boost working in CuPy, integrating Boost into xsf, and documenting how to use Boost in xsf.
  4. Working on low hanging fruit: Things which would be easy to get working on GPU, but which we just haven't gotten to yet. Examples include, remaining Cython scalar kernels and most of cdflib, parts of amos and specfun which don't rely on recursion or internal state. The goal here is essentially to maximize the number of functions made GPU capable by the end of the year.

Request for Comment

Does anyone have any objections to splitting the scipy.special scalar kernel codebase into a separate repo, xsf, which is managed under the SciPy organization? Does anyone anticipate any technical or non-technical difficulties that I may be unaware of? One thing I think is really important to get right is testing and CI, so I plan to make a separate RFC this week just to discuss these issues. The main thing is that requiring a full SciPy build for xsf CI for each platform would be very time consuming, but some things like cython_special can only be tested in SciPy. Another thing that needs discussion is what the xsf release cycle would look like, and the process for updating xsf in SciPy.

@izaid has agreed to be a maintainer of xsf, and is willing to spend time to review and merge my pull requests. This should allow development of xsf to move faster than special function development in SciPy, where lack of bandwidth from qualified reviewers has lead to a significant bottleneck. Any current SciPy maintainers who are interested in getting involved and want commit rights or triage rights should let me know.

[^1]: An RFC for such an extension was proposed by SciPy maintainer Matt Haberland (@mdhaber) in December 2023, https://github.com/data-apis/array-api/issues/725, written with input from Irwin and I.

[^2]: Shanjie Zhang, Jianming Jin, Computation of Special Functions, Wiley, 1996, ISBN: 0-471-11963-6, LC: QA351.C45.

[^4]: #80NSSC22K0405 - Reinforcing the Foundations of Scientific Python. Awarded jointly to NumPy, pandas, SciPy, and scikit-learn. https://numfocus.medium.com/numfocus-projects-receive-nasa-grants-deee374e7a57

[^8]: Just days ago, AMD published its own libhipcxx implementation of the C++ standard library, which should allow us to also support AMD GPUs with minimal effort on our end.

[^9]: NVRTC is more restrictive and supports a smaller subset of the standard library.

[^10]: There are some special functions for which managing dynamic state is needed for more accurate or efficient computation in some or all parameter regions (e.g. Spheroidal Wave Functions). In these cases, conditional compilation may be needed to swap out implementations based on the platform for which the code is compiled. We may also decide and document that certain functions or parameter regions won't be supported for all platforms.

[^15]: It will be difficult, but developing a GPU capabable implementation of Miller's recurrence algorithm to perform argument reduction with a recurrence in cases where the recurrence is only stable in one direction (and not the one we want), is needed for us to implement accurate Bessel and hypergeometric functions which work on the GPU. This is something I've been hoping to work on for a while now, but I don't think it should be priortized over the four items below.

[^16]: mdhaber began work on the mparray library, which could be used as a home for arbitrary precision reference implementations.

ilayn commented 2 weeks ago

I think this is the right thing to do and I am very happy that you folks are doing it since you also have all the domain knowledge.

I think it is high time that we consider this detail seriously to remove extra bloat. I noticed this while I was wrapping bunch of C extensions that Cython indeed introduces quite some ballast.

Unfortunately, I can't offer much in terms of special functions stuff but happy to offer a hand when we are reintegrating xsf to SciPy with Cython stuff or maybe we don't even need the Cython glue anymore as I started to create Python extensions directly from C, similarly Cython folks can extern import xsf too so we can keep the header declarations only.

[^1]: We would still test wrapping functionality but not the correctness of the functions

lucascolley commented 2 weeks ago

Big +1!! Maybe one day SciPy will really be a Python(-only) library :)

Copy the folder scipy/special/xsf in the SciPy repo to include/xsf in the xsf repo, taking care to preserve the commit history of scipy/special/xsf, perhaps using steps from an answer to this relevant StackOverflow question.

@asmeurer may be able to advise after converting numpy.array_api to array-api-strict.

Any current SciPy maintainers who are interested in getting involved and want commit rights or triage rights should let me know.

I'd be happy to help out here and there with triage rights.

steppi commented 2 weeks ago
  • Not sure if you would appreciate the joy of C++ code testing, to keep xsf precision or correctness, but in my opinion, we don't need to test xsf 1 in SciPy which would be major win for SciPy and whoever adopts xsf along the way.

Yes, completely agree. Testing function correctness in xsf and having light tests for wrappers in SciPy definitely seems like the way to go.

  • Similarly, xsf does not need to build SciPy for the sake of cython_special, I think along the way, we figured out the magical ingredients in that corner of SciPy and probably maintainers can handle that much complexity. This would simplify things substantially for the library itself through separation of concerns.

Agreed, it would be much better to avoid testing cython_special in xsf. I think you're right that we won't need to do this.

Unfortunately, I can't offer much in terms of special functions stuff but happy to offer a hand when we are reintegrating xsf to SciPy with Cython stuff or maybe we don't even need the Cython glue anymore as I started to create Python extensions directly from C, similarly Cython folks can extern import xsf too so we can keep the header declarations only.

Thanks @ilayn, that would still be really helpful. I think right now the benefit cython_special offers is that it offers fused type wrappers for functions which support multiple argument types. Once we have everything in headers, we could just make it a .pxd file instead of an extension module though, which should save a lot of bloat.

I'd be happy to help out here and there with triage rights.

Thanks @lucascolley

mborland commented 2 weeks ago

Even though I'm not a Scipy maintainer I'm more than willing to assist with this if there's things you need help with. I agree it may be hard to completely adapt your code to Boost, and I don't believe your intent is to support arbitrary precision types either. Please correct me if I'm wrong.

On the CI front it's easy to get running with CMake. Something to be aware of is the runtime required for NVRTC testing. It takes about 3-5x longer to run the same tests on NVRTC vs NVCC. I think we're paying 7 cents a minute for CUDA runners so this quickly becomes a non-trivial cost (That linked run would be about $6.58).

One point of advice would be since this is a new project push your language standard as far forward as possible. Working around broken compiler support for C++11 is a huge pain. C++17 would help get rid of lots of the template spaghetti that exists in Boost.Math (We are currently C++14). That would also help NVRTC out a lot because I think it's really hard to parse all the templates given our policy based design. C++20 you can basically make anything happen at compile time but I don't think that matters a ton here. Concepts are pretty nice though. Developers on Reddit routinely complain about Boosts support for old language standards and I'm sure that's driven people away from wanting to contribute.

lucascolley commented 2 weeks ago

Something to be aware of is the runtime required for NVRTC testing. It takes about 5x longer to run the same tests on NVRTC vs NVCC. I think we're paying 7 cents a minute for CUDA runners so this quickly becomes a non-trivial cost (That linked run would be about $6.58).

Not to derail the conversation but cc @aterrel @nabobalis @mriduls in case there is anything to say from a NumFOCUS infra perspective.

One point of advice would be since this is a new project push your language standard as far forward as possible.

SciPy is currently at C++17.

steppi commented 2 weeks ago

Thanks @mborland for the offer to help. If I remember correctly, @izaid and I did briefly discuss supporting arbitrary precision at some point (but in the context of, Boost does this, maybe we should too), but it's not a priority. Thanks for sharing your CUDA workflow; I think I might look into setting up caching so that a CUDA kernel is only recompiled with NVRTC in xsf's CI if one of the headers the kernel depends on (possibly transitively) has been modified.

Agreed on the language standard. Like @lucascolley said, SciPy is currently at C++17 and so is CuPy, so that's what we're going with.

asmeurer commented 2 weeks ago

@asmeurer may be able to advise after converting numpy.array_api to array-api-strict.

I used git-filter-repo. Here's the command I used. https://data-apis.org/array-api-strict/#relationship-to-numpy-array-api. It took several tries to get it all right so I recommend doing extensive testing.

mborland commented 2 weeks ago

Thanks @mborland for the offer to help. If I remember correctly, @izaid and I did briefly discuss supporting arbitrary precision at some point (but in the context of, Boost does this, maybe we should too), but it's not a priority.

Sounds good. If you don't need arbitrary precision support I would skip it, because everything will be significantly simpler. For example in the Boost.Math Bessel i0 implementation all the calculations for builtin types are pre-computed arrays. This approach is quite clean and compiles quickly even with NVRTC. Since you have C++17 you can simply this even further with a chain of if constexpr (std::is_same_v<float, T>) ... instead of using tag dispatching on std::integral_constant like we do. I've had to workaround a few places where NVRTC does not want to work with tag dispatching.

Agreed on the language standard. Like @lucascolley said, SciPy is currently at C++17 and so is CuPy, so that's what we're going with.

C++17 is a good place to be.

fancidev commented 2 weeks ago

This is really great! I like every bit of the plan (maybe no need for arbitrary precision for now). I think it is highly valuable that the routines are easily usable by other projects. Looking forward to it!

izaid commented 1 week ago

I, of course, am a big fan of this plan. There is something I've been wondering about though.

The machinery for creating NumPy ufuncs and gufuncs lives in xsf , as I think it should. I also think xsf should test through pytest using the ufunc and gufunc machinery. So essentially xsf needs to create the ufuncs and gufuncs.

Does this mean we are going to ship a single Python C extension with all the NumPy ufuncs and gufuncs, something equivalent to what is now https://github.com/scipy/scipy/blob/main/scipy/special/_special_ufuncs.cpp? And special then is only responsible for what is now cython_special? Of course, if special wants to add on behaviour or make different ufuncs, it's free to do that.

Likewise for CUDA testing. To do that, we need to rely on CuPy, hence we'll need to make the CuPy ufuncs and gufuncs as well.

The alternative option to this is C++ testing. Which is fine, Google Test is actually extremely good. But I'd rather keep things simple with tests remaining in Python.

fancidev commented 1 week ago

I am hoping to be able to call xsf functions both from Python and from C++, and I have a use case for exactly this.

https://github.com/scipy/scipy/pull/20786 wants to add a js_div special function to be used by scipy.spatial.distance. The distance functions (unfortunately) have two independent implementations: one in Python and one in C++. The Python implementation would call the special function with NumPy array through the ufunc interface (with all the broadcasting and stuff). The C++ implementation would call the kernel directly without going through any wrapper.

How to achieve this seems not obvious in the current set-up. With the spin-off of xsf into a header-only library, plus the generation of Python bindings, I guess it will become rather straightforward once xsf lands.

steppi commented 1 week ago

@izaid, in the short term I think we can just have duplicate ufuncs in xsf and SciPy. Then we defer deciding on whether to move to C++ only testing or whether we should split off SciPy’s ufuncs into a separate Python package. For the time being, testing in Python is most convenient since we can copy over existing tests, and it should be easier to find contributors to work on test code.

ilayn commented 1 week ago

I think you will get hurt if you copy the existing test suite. As we discussed previously, the scipy tests are very unergonomic and very hard to decipher due to extensive dependency injection. I do think by paying some up front investment you would benefit a lot in the future to design your own tests. But that's what I meant above that C++ testing is "acquired taste", I don't know about google test but it is quite hard to compete with pytest ergonomics outside python. So you might want to keep the python bindings but that's up to you. I have no comments either way.

For the ufuncs, we can keep them over the scipy side if need be or get them from xsf. The seeming complexity is because now the translation units and ufunc code are intertwined but if you rip xsf out I can help to reorganize those parts on the scipy side.

steppi commented 1 week ago

I think you will get hurt if you copy the existing test suite. As we discussed previously, the scipy tests are very unergonomic and very hard to decipher due to extensive dependency injection. I do think by paying some up front investment you would benefit a lot in the future to design your own tests.

Yeah, totally agreed that the existing test suite has a lot of problems. My current plan is to just copy over the existing tests this week so we can get something working as soon as possible, and then to make developing a proper test suite a priority over the next month (while getting rid of the original tests). That way we can get the separation out of the way quickly without having to pause special function development while the new test suite is being built.

fancidev commented 1 week ago

except things like distribution functions for the kolmogorov distribution which depend on double double arithmetic

Maybe if you could open a separate issue later to elaborate a bit the problem encountered with double_double? I have used it in _iv_ratio_c, and have come across the need for compensated addition in pow1p.

I guess dd_real from cephes might have caused some trouble on the GPU as it uses the volatile keyword in compensated addition to force rounding. If this is the culprit, then I think it could be worked around by replacing volatile with some tricks, and I’d like to give a try at that.

steppi commented 1 week ago

I've submitted a pull request here, https://github.com/scipy/xsf/pull/2, to migrate the header files to xsf/include/xsf while preserving history. Thanks @asmeurer for the suggestion to use git-filter-repo. @izaid, you need to be added to the SciPy org before I can add you to the XSF Team, but until then, if you approve I'll self merge.

Maybe if you could open a separate issue later to elaborate a bit the problem encountered with double_double? I have used it in _iv_ratio_c, and have come across the need for compensated addition in pow1p.

@fancidev, sorry, I haven't looked into why double_double arithmetic is failing on CUDA, have no insight into what the cause may be, and have no elaboration to give. As I continue getting more functions working on CUDA, I want to start with low-hanging fruit, basically things which I'm confident I could do immediately without having to spend time problem-solving, so that's why I've excluded double_double from immediate consideration.