Closed league closed 2 years ago
Interesting. The reason that the nix build is failing on MacOS is because configure picks up clang
as the compiler, and in the regular macos-latest build it uses g++
. The configure script reports that clang supports 17:
checking whether clang++ supports C++17 features with -std=c++17... yes
but then the linker fails to find the definitions for the filesystem stuff:
Undefined symbols for architecture x86_64:
"std::__1::__fs::filesystem::directory_iterator::__dereference() const", referenced from:
remove_file_glob(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
"std::__1::__fs::filesystem::path::__parent_path() const", referenced from:
remove_file_glob(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
get_dirname(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
"std::__1::__fs::filesystem::__remove_all(std::__1::__fs::filesystem::path const&, std::__1::error_code*)", referenced from:
remove_files_recursively(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) in fileutils.o
[...]
Of course I can ask it to bring in gcc as a dependency instead, and then it should behave more like the Linux build. (I'm pretty sure gcc is a dependency anyway, because ctypesgen calls it, and substituting clang in its test suite wasn't working.)
Everything passes now except a codecov (who is rarely happy).
It seems like every CI environment is using -std=c++17
now, so I'm concerned that the pre-17 support (the #else
branches) could bit-rot. I'm not sure what (if anything) to do about that. Maybe configure flag for which C++ standard to target? If unspecified, auto-detect as usual. Or could reach back to older images or compilers (rather than ubuntu-latest) but I'd hate to expand the build matrix further, especially with all the duplication we're seeing. (Looking for some ways to avoid duplication on push-to-feature-branch and sync-PR, such as https://github.com/fkirc/skip-duplicate-actions/ – maybe after self-hosted is merged, could look to reorganize or streamline some of the workflows.)
But I guess the principle is that if a configuration is supported, it would be best to continue testing it.
Some cleanup of the testing workflow would be nice once the self-hosted runner is in place. I'm trying to only have one runner running at a time so anything that would remove duplication would help.
With regards to the testing pre-17 support I generally agree that we should try to cover as much as we can with tests but I'm not sure that this particular change is worth a configure
flag for testing. Maybe this ultimately becomes more of a roadmap question and we need to establish how long we will support C++XX and PythonY.
I'm implementing a little C++ test suite that can be called from the python unittest. I wanted to gain some confidence about the <filesystem>
(and regex) stuff independently of its usage in map and proclog. I think I did find a small issue under certain contexts where remove_file_glob
crashes using the <filesystem>
implementation but works with the older ::system
one. Seems to have to do with dots in the dir name? like remove_file_glob("/tmp/test.1q25k/*.ptx")
.
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
what(): filesystem error: directory iterator cannot open directory: No such file or directory [/tmp/test\.1q25k]
(Note the \.
) Not sure if it's triggered by the dot in $HOME/.bifrost
. Anyway, I'd like to have the failing test exposed before attempting a fix. Maybe a simpler interface would just be remove_files_with_suffix(dir, ".ptx")
then can check the suffix right in the loop without resorting to regex.
Hrm, that's unfortunate. The simpler interface is probably better since we don't really make use of a full glob. It's more like a "remove all files in this directory but not the directory itself".
"remove all files in this directory but not the directory itself".
At one place in map.cpp (validate_cache
), it's this:
remove_file_glob(_cachedir + "*.inf");
remove_file_glob(_cachedir + "*.ptx");
remove_file(_cachedir + BF_MAP_KERNEL_DISK_CACHE_VERSION_FILE);
and at the other site (clear_cache
), just these:
remove_file_glob(_cachedir + "*.inf");
remove_file_glob(_cachedir + "*.ptx");
I suppose the version file could be removed in clear_cache
too, as it would just get recreated when needed? So then these could both become remove_files_in(_cachedir)
?
One other oddity I just ran into: if compiling with CUDA10 and a gcc that supports c++17, I get:
> Building CUDA source file transpose.cu
> nvcc fatal : Value 'c++17' is not defined for option 'std'
> make[1]: *** [autodep.mk:52: transpose.o] Error 1
It's not a problem with CUDA11.
CUDA 11 supports C++17, CUDA 9 and 10 support C++14, and CUDA 8 supports C++11.
Update: afe65be now checks which standard CUDA supports and updates the NVCC flags accordingly. configure
will also fail if CUDA doesn't support at least C++11.
Here's a new one: with g++
7.5.0 configure
says that it supports C++17 but fileutils.cpp
fails to compile because it cannot find #include <filesystem>
. The man page for g++
says that it supports only up to C++1z. We could use the __cpp_lib_filesystem
feature test macro but that only works after g++ 5.
Merging #173 (3e2320f) into master (6074d0e) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #173 +/- ##
=======================================
Coverage 58.37% 58.37%
=======================================
Files 67 67
Lines 5744 5744
=======================================
Hits 3353 3353
Misses 2391 2391
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6074d0e...3e2320f. Read the comment docs.
Things seem ok on a couple of qblocks tests I ran.
Things seem ok on a couple of qblocks tests I ran.
I feel like c++17/filesystem is ready to merge. If there are odd compiler combos that become problematic, we can address it as needed.
I've taken care of the compiler strangeness that I've run into so I'm okay with merging. The only other thing we need is a blurb for the CHANGELOG.
Digging into this more it looks like the underlying problem is that AX_CXX_COMPILE_STDCXX
will accept -std=c++1z
as being an acceptable substitute for -std=c++17
. I think that's how we end up in a situation where configure
will say good to go and then we have a build failure. My new approach is a new config/filesystem.m4
test to explicitly check if we can compile/link with std::filesystem
.
Now it should work as intended.
I think this is now everything I want to put into the PR. The iffy area is C++20 support since we don't have good workflow coverage on that (it looks to only be hit on Build and Test/macos-latest).
Yeah, I'm looking into further testing with C++20. After some false starts, I'm finding a gcc11 that passes the configure checks:
checking whether g++ supports C++20 features with -std=c++20... yes
checking for C++ std::filesystem support... yes
checking for C++ std::string::ends_with support... yes
Will report in a few.
I have a successful compilation and test suite with C++20 using g++ 11.2.0 which uses <filesystem>
and ends_with
. This configuration is not part of CI, but it could be fairly easily if I just slightly rearrange the packages exported by my flake. Could be merged though.
Leaving some debugging notes here on tracing the issue with get_mtu
in Socket.hpp
(only used on Mac). We've seen it return zero in various configurations. Here's a small test program:
#include <iostream>
#include "Socket.hpp"
int main()
{
sockaddr_storage sa = Socket::address("127.0.0.1", 8123, AF_INET);
int mtu = Socket::discover_mtu(sa);
std::cout << mtu << std::endl;
return 0;
}
For the most part it seems like with clang it outputs 16384, and with gcc zero.
The 16384 is what I get on my laptop for 127.0.0.1. For 8.8.8.8 I get 1500. Both of those values match what I see in ifconfig
. 0 is just weird.
That was with clang 13.1. I'm trying to compile this test with gcc11 now.
Isolating the behavior differences further, the compilers are producing different results for the ::get_family
at the very top of get_mtu
. Here's my latest test program, where I'm extracting pieces out of Socket.hpp to find the difference...
#include <iostream>
#include "Socket.hpp"
int main()
{
sockaddr_storage sa = Socket::address("127.0.0.1", 8123, AF_INET);
// int mtu = Socket::discover_mtu(sa);
int fd = ::socket(AF_INET, SOCK_DGRAM, 0);
std::cout << "Socket on FD " << fd << std::endl;
sockaddr addr;
socklen_t len;
int ret = ::connect(fd, (sockaddr*)&sa, sizeof(sockaddr));
std::cout << "connect returned " << ret << std::endl;
ret = ::getsockname(fd, &addr, &len);
std::cout << "getsockname returned " << ret << std::endl;
sockaddr_in* ina = reinterpret_cast<sockaddr_in*> (&addr);
std::cout << "sockaddr_in* is " << ina << std::endl;
std::cout << "FAMILY " << (short)ina->sin_family << std::endl;
// std::cout << mtu << std::endl;
return 0;
}
g++ seems to always say FAMILY 0
and clang is getting different values, such as FAMILY 175
, 63
, 207
. I wonder if the reinterpret_cast
is dodgy. I'm not completely familiar with these sockaddr
structures and what's supposed to work. Right now, with this test, I can't distinguish between gcc 10 and gcc 11 though!
What about changing ina->sin_family
to addr.sa_family
?
Just tried addr.sa_family
and it always matches ina->sin_family
.
Wheeee, maybe I found the issue.... patch coming.
Ha ha, I hope so. I've been sitting here trying to figure out why I get two very different things between clang and gcc.
The nice thing about my discovery is that it is definitely something that could vary depending on the vagaries of compiler and version. You'll laugh, you'll cry. :) Just trying it out locally on the real python test (not just my local hacked up C++ test).
And it doesn't have to do with C++ references as in socket-reference
… AFAICT.
What's so annoying about this is that I did do that in get_mtu
but not get_family
.
Oh right! That somehow didn't jump out at me either… I didn't find it by comparing those two sections. It's just that the sin_family
or sa_family
was coming back different, though the ret
was always zero. And when sa_family
was wrong, it could be different values on different runs. Once I isolated it to that function, I decided maybe I should actually read the manual for getsockname
! :shrug: Along the way, I actually verified (by writing a little hex dump function) that all 128 bytes of sockaddr_storage
were getting passed around (by value) without loss or corruption. Never did get gdb
to work properly with my test program on mac.
Thought I'd convert this branch to a PR, so I can see and gather the changes better.