ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Use C++17 support (esp std::filesystem) when available #173

Closed league closed 2 years ago

league commented 2 years ago

Thought I'd convert this branch to a PR, so I can see and gather the changes better.

league commented 2 years ago

Interesting. The reason that the nix build is failing on MacOS is because configure picks up clang as the compiler, and in the regular macos-latest build it uses g++. The configure script reports that clang supports 17:

checking whether clang++ supports C++17 features with -std=c++17... yes

but then the linker fails to find the definitions for the filesystem stuff:

Undefined symbols for architecture x86_64:
  "std::__1::__fs::filesystem::directory_iterator::__dereference() const", referenced from:
      remove_file_glob(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
  "std::__1::__fs::filesystem::path::__parent_path() const", referenced from:
      remove_file_glob(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
      get_dirname(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
  "std::__1::__fs::filesystem::__remove_all(std::__1::__fs::filesystem::path const&, std::__1::error_code*)", referenced from:
      remove_files_recursively(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
[...]

Of course I can ask it to bring in gcc as a dependency instead, and then it should behave more like the Linux build. (I'm pretty sure gcc is a dependency anyway, because ctypesgen calls it, and substituting clang in its test suite wasn't working.)

league commented 2 years ago

Everything passes now except a codecov (who is rarely happy).

It seems like every CI environment is using -std=c++17 now, so I'm concerned that the pre-17 support (the #else branches) could bit-rot. I'm not sure what (if anything) to do about that. Maybe configure flag for which C++ standard to target? If unspecified, auto-detect as usual. Or could reach back to older images or compilers (rather than ubuntu-latest) but I'd hate to expand the build matrix further, especially with all the duplication we're seeing. (Looking for some ways to avoid duplication on push-to-feature-branch and sync-PR, such as https://github.com/fkirc/skip-duplicate-actions/ – maybe after self-hosted is merged, could look to reorganize or streamline some of the workflows.)

But I guess the principle is that if a configuration is supported, it would be best to continue testing it.

jaycedowell commented 2 years ago

Some cleanup of the testing workflow would be nice once the self-hosted runner is in place. I'm trying to only have one runner running at a time so anything that would remove duplication would help.

With regards to the testing pre-17 support I generally agree that we should try to cover as much as we can with tests but I'm not sure that this particular change is worth a configure flag for testing. Maybe this ultimately becomes more of a roadmap question and we need to establish how long we will support C++XX and PythonY.

league commented 2 years ago

I'm implementing a little C++ test suite that can be called from the python unittest. I wanted to gain some confidence about the <filesystem> (and regex) stuff independently of its usage in map and proclog. I think I did find a small issue under certain contexts where remove_file_glob crashes using the <filesystem> implementation but works with the older ::system one. Seems to have to do with dots in the dir name? like remove_file_glob("/tmp/test.1q25k/*.ptx").

terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: directory iterator cannot open directory: No such file or directory [/tmp/test\.1q25k]

(Note the \.) Not sure if it's triggered by the dot in $HOME/.bifrost. Anyway, I'd like to have the failing test exposed before attempting a fix. Maybe a simpler interface would just be remove_files_with_suffix(dir, ".ptx") then can check the suffix right in the loop without resorting to regex.

jaycedowell commented 2 years ago

Hrm, that's unfortunate. The simpler interface is probably better since we don't really make use of a full glob. It's more like a "remove all files in this directory but not the directory itself".

league commented 2 years ago

"remove all files in this directory but not the directory itself".

At one place in map.cpp (validate_cache), it's this:

                remove_file_glob(_cachedir + "*.inf");
                remove_file_glob(_cachedir + "*.ptx");
                remove_file(_cachedir + BF_MAP_KERNEL_DISK_CACHE_VERSION_FILE);

and at the other site (clear_cache), just these:

                remove_file_glob(_cachedir + "*.inf");
                remove_file_glob(_cachedir + "*.ptx");

I suppose the version file could be removed in clear_cache too, as it would just get recreated when needed? So then these could both become remove_files_in(_cachedir)?

league commented 2 years ago

One other oddity I just ran into: if compiling with CUDA10 and a gcc that supports c++17, I get:

> Building CUDA source file transpose.cu
> nvcc fatal   : Value 'c++17' is not defined for option 'std'
> make[1]: *** [autodep.mk:52: transpose.o] Error 1

It's not a problem with CUDA11.

jaycedowell commented 2 years ago

CUDA 11 supports C++17, CUDA 9 and 10 support C++14, and CUDA 8 supports C++11.

Update: afe65be now checks which standard CUDA supports and updates the NVCC flags accordingly. configure will also fail if CUDA doesn't support at least C++11.

jaycedowell commented 2 years ago

Here's a new one: with g++ 7.5.0 configure says that it supports C++17 but fileutils.cpp fails to compile because it cannot find #include <filesystem>. The man page for g++ says that it supports only up to C++1z. We could use the __cpp_lib_filesystem feature test macro but that only works after g++ 5.

codecov-commenter commented 2 years ago

Codecov Report

Merging #173 (3e2320f) into master (6074d0e) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #173   +/-   ##
=======================================
  Coverage   58.37%   58.37%           
=======================================
  Files          67       67           
  Lines        5744     5744           
=======================================
  Hits         3353     3353           
  Misses       2391     2391           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6074d0e...3e2320f. Read the comment docs.

jaycedowell commented 2 years ago

Things seem ok on a couple of qblocks tests I ran.

league commented 2 years ago

Things seem ok on a couple of qblocks tests I ran.

I feel like c++17/filesystem is ready to merge. If there are odd compiler combos that become problematic, we can address it as needed.

jaycedowell commented 2 years ago

I've taken care of the compiler strangeness that I've run into so I'm okay with merging. The only other thing we need is a blurb for the CHANGELOG.

jaycedowell commented 2 years ago

Digging into this more it looks like the underlying problem is that AX_CXX_COMPILE_STDCXX will accept -std=c++1z as being an acceptable substitute for -std=c++17. I think that's how we end up in a situation where configure will say good to go and then we have a build failure. My new approach is a new config/filesystem.m4 test to explicitly check if we can compile/link with std::filesystem.

jaycedowell commented 2 years ago

Now it should work as intended.

jaycedowell commented 2 years ago

I think this is now everything I want to put into the PR. The iffy area is C++20 support since we don't have good workflow coverage on that (it looks to only be hit on Build and Test/macos-latest).

league commented 2 years ago

Yeah, I'm looking into further testing with C++20. After some false starts, I'm finding a gcc11 that passes the configure checks:

checking whether g++ supports C++20 features with -std=c++20... yes
checking for C++ std::filesystem support... yes
checking for C++ std::string::ends_with support... yes

Will report in a few.

league commented 2 years ago

I have a successful compilation and test suite with C++20 using g++ 11.2.0 which uses <filesystem> and ends_with. This configuration is not part of CI, but it could be fairly easily if I just slightly rearrange the packages exported by my flake. Could be merged though.

league commented 2 years ago

Leaving some debugging notes here on tracing the issue with get_mtu in Socket.hpp (only used on Mac). We've seen it return zero in various configurations. Here's a small test program:

#include <iostream>
#include "Socket.hpp"
int main()
{
  sockaddr_storage sa = Socket::address("127.0.0.1", 8123, AF_INET);
  int mtu = Socket::discover_mtu(sa);
  std::cout << mtu << std::endl;
  return 0;
}

For the most part it seems like with clang it outputs 16384, and with gcc zero.

jaycedowell commented 2 years ago

The 16384 is what I get on my laptop for 127.0.0.1. For 8.8.8.8 I get 1500. Both of those values match what I see in ifconfig. 0 is just weird.

jaycedowell commented 2 years ago

That was with clang 13.1. I'm trying to compile this test with gcc11 now.

league commented 2 years ago

Isolating the behavior differences further, the compilers are producing different results for the ::get_family at the very top of get_mtu. Here's my latest test program, where I'm extracting pieces out of Socket.hpp to find the difference...

#include <iostream>
#include "Socket.hpp"
int main()
{
  sockaddr_storage sa = Socket::address("127.0.0.1", 8123, AF_INET);
  // int mtu = Socket::discover_mtu(sa);
  int fd = ::socket(AF_INET, SOCK_DGRAM, 0);
  std::cout << "Socket on FD " << fd << std::endl;
  sockaddr addr;
  socklen_t len;
  int ret = ::connect(fd, (sockaddr*)&sa, sizeof(sockaddr));
  std::cout << "connect returned " << ret << std::endl;
  ret = ::getsockname(fd, &addr, &len);
  std::cout << "getsockname returned " << ret << std::endl;
  sockaddr_in* ina = reinterpret_cast<sockaddr_in*> (&addr);
  std::cout << "sockaddr_in* is " << ina << std::endl;
  std::cout << "FAMILY " << (short)ina->sin_family << std::endl;

  // std::cout << mtu << std::endl;
  return 0;
}

g++ seems to always say FAMILY 0 and clang is getting different values, such as FAMILY 175, 63, 207. I wonder if the reinterpret_cast is dodgy. I'm not completely familiar with these sockaddr structures and what's supposed to work. Right now, with this test, I can't distinguish between gcc 10 and gcc 11 though!

jaycedowell commented 2 years ago

What about changing ina->sin_family to addr.sa_family?

league commented 2 years ago

Just tried addr.sa_family and it always matches ina->sin_family.

league commented 2 years ago

Wheeee, maybe I found the issue.... patch coming.

jaycedowell commented 2 years ago

Ha ha, I hope so. I've been sitting here trying to figure out why I get two very different things between clang and gcc.

league commented 2 years ago

The nice thing about my discovery is that it is definitely something that could vary depending on the vagaries of compiler and version. You'll laugh, you'll cry. :) Just trying it out locally on the real python test (not just my local hacked up C++ test).

league commented 2 years ago

And it doesn't have to do with C++ references as in socket-reference… AFAICT.

jaycedowell commented 2 years ago

What's so annoying about this is that I did do that in get_mtu but not get_family.

league commented 2 years ago

Oh right! That somehow didn't jump out at me either… I didn't find it by comparing those two sections. It's just that the sin_family or sa_family was coming back different, though the ret was always zero. And when sa_family was wrong, it could be different values on different runs. Once I isolated it to that function, I decided maybe I should actually read the manual for getsockname! :shrug: Along the way, I actually verified (by writing a little hex dump function) that all 128 bytes of sockaddr_storage were getting passed around (by value) without loss or corruption. Never did get gdb to work properly with my test program on mac.