Does not build on CURRENT

arrowd commented 4 years ago

I met some problems when tried building nvshim on FreeBSD CURRENT.

First of all, there is no sys/sysinfo.h file anymore, so I had to remove that include directive from src/libc/sys/sysinfo.c. Second, I changed clang60 to clang80. Now I get

error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_siglist
error: multiple symbol versions defined for shim_clock_getcpuclockid
error: multiple symbol versions defined for shim_clock_getres
error: multiple symbol versions defined for shim_clock_gettime
error: multiple symbol versions defined for shim_clock_nanosleep

Any idea how to fix this?

shkhln commented 4 years ago

The error is due to a (relatively) recently introduced sanity check, see https://reviews.llvm.org/D45845. At the moment I have no idea how to work around that.

arrowd commented 4 years ago

Can you give an overview how things work? Maybe I can come up with something.

shkhln commented 4 years ago

What am I doing with symver directives in the first place? They are used to export shim_sym symbols as sym@GLIBC_ver while specifically avoiding exporting so-called default symbols (sym@@GLIBC_ver). That way FreeBSD rtld will link Linux libraries to this shim and everything else to the normal libc. I wasn't able to get the same result with version scripts.

Why multiple versions? Glibc might have multiple different implementations of something, but we, obviously, do not. So, it's kinda makes sense to export some symbols multiple times.

shkhln commented 4 years ago

multiple versions

For example:

% readelf -s /compat/linux/lib/libc-2.17.so | grep environ | grep GLIBC
   308: 00000000001c8de0     4 OBJECT  WEAK   DEFAULT   35 _environ@@GLIBC_2.0 (2)
  1039: 00000000001c8de0     4 OBJECT  WEAK   DEFAULT   35 environ@@GLIBC_2.0 (2)
  1395: 00000000001c8de0     4 OBJECT  GLOBAL DEFAULT   35 __environ@@GLIBC_2.0 (2)

arrowd commented 4 years ago

So, this shim is used to fulfill the dependency on Linux libc? Ot intercepts calls to Linux libc functions and re-route them to FreeBSD ones?

shkhln commented 4 years ago

So, this shim is used to fulfill the dependency on Linux libc?

Yeah, the "nv" part in the name is a bit of misnomer. It's rather a reasonably generic, albeit very incomplete, glibc shim with Nvidia specific run & setup scripts.

Ot intercepts calls to Linux libc functions and re-route them to FreeBSD ones?

There is no interception per se. If you export symbols as described above (that is, no defaults), rtld will happily route everything for you. It's really simple. Almost unbelievably so.

arrowd commented 4 years ago

I see, thanks. But is that safe? I bet, many FreeBSD libc functions operate differently that Linux ones. This might cause problems, no?

shkhln commented 4 years ago

Other than the bugs in the implementation itself (there are plenty of those, of course), this should be reasonably safe as long as the loaded Linux libraries:

do not use direct syscalls;
do not pass libc data structures and/or constant values to FreeBSD libraries though their own API, in which case they would skip our conversions.

So, proceed with caution.

arrowd commented 4 years ago

If GAS doesn't support multiple .symver directives, how does original Linux libc end up with duplicating symbols?

shkhln commented 4 years ago

Hmm… Glibc seems to use __attribute__(alias(...)). Let's see whether that's suitable for us…

shkhln commented 4 years ago

Ok, try the latest commit.

while specifically avoiding exporting so-called default symbols (sym@@GLIBC_ver)

I rechecked that part, turns out default versions do not matter either way, I just dislike them for some reason I can't remember :/

arrowd commented 4 years ago

Yep, it builds! I had to remove all mentions of sys/sysinfo.h, though.

Now, I'm trying to use sglrun to execute CUDA binary. Here's my attempt:

env LD_LIBRARY_PATH=/usr/home/arr/cuda101/var/cuda-repo-10-1-local-10.1.243-418.87.00/usr/local/cuda-10.1/lib64 ./sglrun ~/axpy
ld-elf.so.1: Shared object "ld-linux-x86-64.so.2" not found, required by "libcudart.so.10.1"

Added path to ld-linux-x86-64.so.2:

env LD_LIBRARY_PATH=/usr/home/arr/cuda101/var/cuda-repo-10-1-local-10.1.243-418.87.00/usr/local/cuda-10.1/lib64:/compat/linux/usr/lib64/ ./sglrun ~/axpy
ld-elf.so.1: /compat/linux/usr/lib64//librt.so.1: version FBSD_1.0 required by /usr/local/lib/libruby26.so.26 not found

The error message looks strange. Any idea what does it mean?

shkhln commented 4 years ago

You'll probably want to know that for CUDA there are some blocking issues on the kernel driver side.

The error message looks strange. Any idea what does it mean?

You are setting LD_LIBRARY_PATH a bit too early and it is getting picked up by a FreeBSD executable trying to run the script itself.

arrowd commented 4 years ago

You'll probably want to know that for CUDA there are some blocking issues on the kernel driver side.

I didn't even reach that stage yet. Have you checked if things improved so far?

Stupid me. Now the error is

ld-elf.so.1: /usr/home/arr/projects/nvshim/build/lib64/nvshim.so: version GLIBC_PRIVATE required by /compat/linux/usr/lib64//librt.so.1 not found

shkhln commented 4 years ago

Have you checked if things improved so far?

They didn't.

Now the error is

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named. I admit this is a bit confusing.

arrowd commented 4 years ago

They didn't.

If I read it right, the problem is that os_lock_user_pages Linux syscall is not implemented in Linuxulator? Maybe a separate PR should be opened to track this? As for bug 224358, what should be done to close it?

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named.

Ouch. This is too much hacks, IMO. I think, I'll try something else for my problem.

shkhln commented 4 years ago

If I read it right, the problem is that os_lock_user_pages Linux syscall is not implemented in Linuxulator?

Eh, it's a function in nvidia.ko kernel module, see nvidia/nvidia_os.c.

As for bug 224358, what should be done to close it?

Not my bug, so no opinion.

Ouch. This is too much hacks, IMO.

Is it? In any case, the kernel part is both more important and more difficult here. It makes sense to concentrate on that.

arrowd commented 4 years ago

I see, thanks for clearing this up. Let's close this issue, as nvshim now builds on CURRENT.

shkhln commented 4 years ago

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named.

Ouch.

FYI, I committed a workaround for this particular annoyance in c1633b68f97616a050a926a7b6de850a91d25331.

shkhln commented 4 years ago

@arrowd You might be interested in https://github.com/shkhln/revird-aidivn/commit/077197f6311efb0a1f7d88b84bb43845afa44d86. Note that this is meant to be used in combination with either a dummy nvidia-uvm kernel module or an equivalent LD_PRELOAD trick. Seems to pass a simple sanity check so far:

% env LD_PRELOAD=$PWD/dummy-uvm.so ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 1660" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 542.34 GFlop/s, Time= 0.242 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

arrowd commented 4 years ago

This looks promising! Unfortunately, I don't have time to look at it right now.

Do you plan to upstream this patch into FreeBSD ports tree?

shkhln commented 4 years ago

Do you plan to upstream this patch into FreeBSD ports tree?

I don't have the patience. Plus it's a relatively quick and dirty patch job, so it's not necessarily appropriate for submission as is.

shkhln commented 4 years ago

(lldb) bt
* thread #1, name = 'linux_oceanFFT', stop reason = signal SIGBUS
  * frame #0: 0x00000008007c2a18 libcxxrt.so.1`vtable for __cxxabiv1::__si_class_type_info + 16
    frame #1: 0x0000000809cc5026 libstdc++.so.6`__dynamic_cast + 102
    frame #2: 0x0000000809d434c0 libstdc++.so.6`bool std::has_facet<std::ctype<char> >(std::locale const&) + 64
    frame #3: 0x0000000809d36ba4 libstdc++.so.6`std::basic_ios<char, std::char_traits<char> >::_M_cache_locale(std::locale const&) + 20
    frame #4: 0x0000000809d37020 libstdc++.so.6`std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) + 32
    frame #5: 0x0000000809cd8ab3 libstdc++.so.6`std::ios_base::Init::Init() + 595
    frame #6: 0x000000080083dba4 libcufft.so.8.0`___lldb_unnamed_symbol390$$libcufft.so.8.0 + 36
    frame #7: 0x0000000800a2b4e6 libcufft.so.8.0`___lldb_unnamed_symbol11991$$libcufft.so.8.0 + 70
    frame #8: 0x0000000800825ae3 libcufft.so.8.0
    frame #9: 0x00000008006a734c ld-elf.so.1
    frame #10: 0x00000008006a61d2 ld-elf.so.1

Что-то здесь не так.

arrowd commented 4 years ago

Does libcxxrt.so.1 come from base? It might be that libstdc++ should use libsupc++ or something like that.

shkhln commented 4 years ago

Yes, native libGLU.so.1 brings libcxxrt.so.1, which conflicts with libcufft.so.8.0. The program (oceanFFT from the CUDA demo suite) works with Linux libGLU.so.1, but that is not what we are interested in :)

arrowd commented 4 years ago

Compiling libGLU.so.1 with USE_GCC=yes should fix this problem, but this isn't really a solution, but a workaround. No idea how to handle this properly.

shkhln commented 4 years ago

As far as I understand, native FreeBSD gcc- and clang-compiled c++ libraries are pretty safe to mix. I don't see why that should be different for a Linux c++ library in principle, considering it's the same ABI.

Compiling libGLU.so.1 with USE_GCC=yes should fix this problem, but this isn't really a solution, but a workaround.

That works.

arrowd commented 4 years ago

As far as I understand, native FreeBSD gcc- and clang-compiled c++ libraries are pretty safe to mix.

From my experience, it was never been like this.

That works.

... and here's another proof of that.

shkhln commented 4 years ago

The funny thing is that this working libGLU.so.1 is compiled against libc++.so.1/libcxxrt.so.1 as well. I'll try to test this more thoroughly.

shkhln commented 4 years ago

working libGLU.so.1 is compiled against libc++.so.1/libcxxrt.so.1

Ok, I didn't pay attention and passed a wrong path to ldd. It's libstdc++.so.6, as it should be.

shkhln commented 4 years ago

Aha, the most relevant comment here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221288#c4. Looks like in practice ports are using this solution instead: https://markmail.org/message/dwgafctuoywpuhhr. Not applicable to our case, unfortunately.

arrowd commented 4 years ago

Just to clear things up - libstdc++ is pulled in by Linux CUDA library and everything else is FreeBSD native code, right?

shkhln commented 4 years ago

Specifically, libcufft.so.8.0 is a Linux binary as well as oceanFFT executable itself (I occasionally run Linux programs with the shim, it's convenient for testing). Everything else, including libstdc++, is native code.

arrowd commented 4 years ago

If the executable itself is Linux, where do FreeBSD come from? It should use everything from /compat/linux.

shkhln commented 4 years ago

Something like sglrun /libexec/ld-elf.so.1 /compat/linux/bin/glxgears. Changing the interpreter path (and "GNU" strings placed right after it) with a hex editor also works.

arrowd commented 4 years ago

Then how about making sglrun map native libcxxrt to Linux libsupc++?

shkhln commented 4 years ago

Честно говоря, я даже не понял что здесь написано. Просто замапить libcxxrt на libsupc++ не получится: libstdc++.so требует "version CXXRT_1.0", если избавиться от libstdc++.so вылезают ошибки вроде Undefined symbol "_ZNSt3__15ctypeIcE2idE" и т. д. Линуксовая версия библиотеки здесь тоже ровным счетом ничего не улучшит. Пересборка libstdc++ без libsupc++ выглядит наиболее адекватной идеей пока что.

shkhln commented 4 years ago

Хотя… Можно замапить libcxxrt на наш набор костылей, написать в экспорты CXXRT_1.0 {} и слинковать это все с libstdc++. После прохождения проверки на наличие экпортирумой версии rtld уже совершенно все равно в какой библиотеке их искать. Для librt.so.1 примерно это и было сделано.

arrowd commented 4 years ago

Я понял так, что проблема здесь в том, что libcufft.so.8.0 была собрана для libstdc++, которая в свою очередь юзала libsupc++. На фряхе у нас libcxxrt вместо libsupc++, но она бинарно не совместима.

На libcufft.so.8.0 мы влиять никак не можем, поэтому надо либо все переводить с libc++ на libstdc++ (читай, собирать все с USE_GCC=yes), либо каким-то образом подсунуть libsupc++ вместо libcxxrt.

shkhln commented 4 years ago

Получается как-то так: 09fa1626a9e34ea7412fc32fe74ac18dfbfde083.

shkhln commented 4 years ago

При более внимательном рассмотрении на CXXRT_1.0 из libcxxrt завязана целая куча библиотек, т. е. это активно используемый код, который нельзя просто так заменить заглушкой. Можно наверно и это подкостылить, но как-то лень… (Я понятия не имею будет ли альтернативный хак в виде сборки libstdc++, завязанного на libcxxrt проще. Вполне возможно что будет.)

В любом случае, с драйвером и библиотеками здесь все примерно ясно. Стоит ли делать с этим что-то дальше? Я как-то не вижу никакого интереса к CUDA со стороны FreeBSD, если честно.

arrowd commented 4 years ago

Я как-то не вижу никакого интереса к CUDA со стороны FreeBSD, если честно.

Ну тут нечего смотреть. Будет куда - будет интерес, порты появятся, юзающие ее.

Но затыкать плюсовый рантайм - плохая идея, конечно.

Эта libcufft.so.8.0 - часть дистрибутива куды, или какая-то 3rd-party либа, юзающая куду?

shkhln commented 4 years ago

Ммм… Я не имею в виду пользовательский интерес. Есть какие-то шансы уговорить кого-нибудь привести в нормальный вид патч для драйвера? Это прямо совсем не мой навык. (Тут еще немного замешана лицензия, которая одновременно разрешает и запрещает нам копирование кода из Линуксового драйвера. Это тоже по-своему интересно, хотя большой опасности здесь нет.)

Эта libcufft.so.8.0 - часть дистрибутива куды, или какая-то 3rd-party либа, юзающая куду?

Часть дистрибутива Куды, конечно. Идущая с драйвером libcuda.so это (предположительно) обычная сишная библиотка, а вот всякие надстройки над ней уже как бы нет.

arrowd commented 4 years ago

А что за патч? Об этом речь? https://reviews.freebsd.org/D22521

Часть дистрибутива Куды, конечно. Идущая с драйвером libcuda.so это (предположительно) обычная сишная библиотка, а вот всякие надстройки над ней уже как бы нет.

Остается запускать кудышные приложения целиком в линуксе, либо собирать с USE_GCC=yes.

shkhln commented 4 years ago

Нет, вот этот патч: https://github.com/shkhln/revird-aidivn/compare/master...afdiuxc.patch. Он здесь уже упоминался.

Остается запускать кудышные приложения целиком в линуксе, либо собирать с USE_GCC=yes.

Я думаю кто-нибудь соберет «правильный» libstdc++, если оно реально понадобится. Можно на это пока не обращать внимания.

arrowd commented 4 years ago

Нет, вот этот https://github.com/shkhln/revird-aidivn/compare/master...afdiuxc.patch. Он здесь уже упоминался.

О, я его профукал. Ну, за лицензию можно не особо беспокоиться, я думаю, т.к. это патчи. Что тут за проблема может быть?

Сложнее будет запинать danfe, но это я могу взять на себя.

Я думаю кто-нибудь соберет «правильный» libstdc++, если оно реально понадобится. Можно на это пока не обращать внимания.

Я в этом не уверен. Это libc++ умеет юзать разные рантаймы, а за libstdc++ я не в курсе.

shkhln commented 4 years ago

Ну, за лицензию можно не особо беспокоиться, я думаю, т.к. это патчи. Что тут за проблема может быть?

Местами у Нвидии в файлы натыкан заголовок, запрещающий любое использование кода. В корне дистрибутива драйвера лежит более-менее нормальная лицензия. Насколько я понимаю, они друг друга не отменяют. Очень теоретическая проблема по большей части.

Сложнее будет запинать danfe, но это я могу взять на себя.

Да уж.

Я в этом не уверен. Это libc++ умеет юзать разные рантаймы, а за libstdc++ я не в курсе.

Я не с потолка взял эту идею, а из комментариев типа вот этого: http://lists.llvm.org/pipermail/cfe-dev/2016-August/050278.html. Цитата: «We solved this in FreeBSD by linking both libstdc++ and libc++ against libcxxrt.» Не знаю куда это решение делось.

Здесь также немного про это есть: https://wiki.freebsd.org/NewC++Stack.

shkhln / libc6-shim

Does not build on CURRENT #1