ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

Problems encountered while installing and running 'make -j' #154

Closed yushilingco closed 2 years ago

yushilingco commented 2 years ago

$ make -j udp_capture.cpp:87:32: fatal error: mellanox/vma_extra.h: No such file or directory compilation terminated. autodep.mk:32: recipe for target 'udp_capture.o' failed make[1]: [udp_capture.o] Error 1 ... ptxas /tmp/tmpxft_000029f6_00000000-7_fft.compute_35.ptx, line 2379; error : Illegal operand type to instruction 'ld' ptxas /tmp/tmpxft_000029f6_00000000-7_fft.compute_35.ptx, line 2379; error : Unknown symbol '__unnamed_1_param_0' ptxas /tmp/tmpxft_000029f6_00000000-7_fft.compute_35.ptx, line 2747; error : Illegal operand type to instruction 'ld' ptxas /tmp/tmpxft_000029f6_00000000-7_fft.compute_35.ptx, line 2747; error : Unknown symbol '__unnamed_2_param_0' ptxas fatal : Ptx assembly aborted due to errors autodep.mk:56: recipe for target 'fft.o' failed make[1]: [fft.o] Error 255 make[1]: Leaving directory '/home/yushiling/software/bifrost/bifrost-master/src' Makefile:15: recipe for target 'libbifrost' failed make: *** [libbifrost] Error 2

I encountered the above problems during installation. I tried to install the missing package, but it seems incomplete. I hereby ask for the reason. Thank you very much.

telegraphic commented 2 years ago

Hi @yushilingco, we are looking at a better build system to catch errors like this (some of the packet capture code needs mellanox libvma, and it isn't able to find this on your system).

The quickest way to get it to compile (for now) would be to delete or comment out these lines in https://github.com/ledatelescope/bifrost/blob/master/src/Makefile#L16

  udp_socket.o \
  udp_capture.o \
  udp_transmit.o \ 
yushilingco commented 2 years ago

Thank you for your reply. But what should I do if I want to use the capture function of UDP. Do I need to prepare any necessary environment so that I can try again. Thank you very much!

jaycedowell commented 2 years ago

For the missing mellanox/vma_extra.h you could try modifying the user.mk file and commenting out the line that saysVMA = 1 # Enable use of Mellanox... so that Bifrost doesn't try to use it. Otherwise you should be able to install vma_extra.h from libvma-dev package that is part of the Mellanox OFED bundle.

yushilingco commented 2 years ago

Thank you for your reply, I get it. That means I can only use NVIDIA's network card, right? And if I use an Intel network card, is it feasible? If you have any suggestions, please let me know. Thank you!

yushilingco commented 2 years ago

I have successfully installed libvma, so the previous problem has been solved. But for the latter question, do you have any suggestions:

ptxas /tmp/tmpxft_000031be_00000000-7_fft.compute_35.ptx, line 2379; error : Illegal operand type to instruction 'ld' ptxas /tmp/tmpxft_000031be_00000000-7_fft.compute_35.ptx, line 2379; error : Unknown symbol '__unnamed_1_param_0' ptxas /tmp/tmpxft_000031be_00000000-7_fft.compute_35.ptx, line 2747; error : Illegal operand type to instruction 'ld' ptxas /tmp/tmpxft_000031be_00000000-7_fft.compute_35.ptx, line 2747; error : Unknown symbol '__unnamed_2_param_0' ptxas fatal : Ptx assembly aborted due to errors autodep.mk:56: recipe for target 'fft.o' failed make[1]: [fft.o] Error 255 make[1]: Leaving directory '/home/yushiling/software/bifrost/bifrost-master/src' Makefile:15: recipe for target 'libbifrost' failed make: [libbifrost] Error 2

jaycedowell commented 2 years ago

To your first question, no, you don't have to have a Mellanox card to use Bifrost (I've used it with Intel cards before). However, you will get better packet capture performance with a Mellanox card and libvma enabled.

For your second question, it looks like a problem building the CUDA portions of the library . What GPU and version of CUDA are you using?

yushilingco commented 2 years ago

Thank you very much for your reply. That's great. I show you the results of running "NVIDIA SMI" as follows:

NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4

I also found that the CUDA version may have caused the problem. So I tried to install CUDA 8.0, 9.0, 10.0 and 11.1, but there were errors in different situations. However, the previous error reports still exist. And added a hint:

nvlink fatal : Input file 'fft_kernels.o' newer than toolkit (111 vs 91) (target: sm_35)

jaycedowell commented 2 years ago

Ok, and what model of GPU are you using?

yushilingco commented 2 years ago

Nvidia RTX 3080

jaycedowell commented 2 years ago

In that case try changing the GPU_ARCHS line in user.mk file to GPU_ARCHS ?= 86 to target your RTX 3080.

yushilingco commented 2 years ago

Yes, great. I feel there are fewer and fewer mistakes. But there are still several errors in .o files.

nvcc fatal : Unsupported gpu architecture 'compute_' Makefile:162: recipe for target 'fft_kernels.o' failed

Is it the problem with my CUDA version?

jaycedowell commented 2 years ago

No, CUDA 11.4 should be fine for that architecture/compute capability. You can check with a:

nvcc -h | grep -Po "compute_[0-9]{2}" | sort | uniq

to see if "compute_86" is listed. However, that error would indicate otherwise. What you might try is removing the .SILENT: line from src/Makefile and re-running make to see what the full compiler call is for fft_kernels.cu.

yushilingco commented 2 years ago

Yes, "compute86" is listed. $sudo make make -C src all make[1]: Entering directory '/home/yushiling/software/bifrost/bifrost-master/src' nvcc -O3 -Xcompiler "-Wall" -std=c++11 -Xcompiler "-fPIC" -gencode arch=compute,\"code=compute_\" -DBF_GPU_SHAREDMEM=49152 -g -G -Xcompiler "-march=native" -DBF_DEBUG=1 -DBF_TRACE_ENABLED=1 -DBF_NUMA_ENABLED=1 -DBF_HWLOC_ENABLED=1 -DBF_VMA_ENABLED=1 -DBF_ALIGNMENT=4096 -DBF_CUDAENABLED=1 -I. -I. -I/usr/local/cuda/include -Xcompiler "-fmessage-length=80 " -c -o transpose.o transpose.cu nvcc fatal : Unsupported gpu architecture 'compute' autodep.mk:56: recipe for target 'transpose.o' failed make[1]: *** [transpose.o] Error 1 make[1]: Leaving directory '/home/yushiling/software/bifrost/bifrost-master/src'

jaycedowell commented 2 years ago

Ok, yeah, it looks like there is a problem in src/Makefile when it tries to set the values of GPU_ARCHS_SUPPORTED, GPU_ARCHS_VALID, and GPU_ARCH_LATEST. I would try manually running the shell commands that are used to define those variables. It's almost like GPU_ARCHS_VALID and GPU_ARCH_LATEST are empty.

yushilingco commented 2 years ago

Thank you very much for your patience. I installed Bifrost once before, but it failed, so this is my second installation. I don't know whether it will affect it?

jaycedowell commented 2 years ago

That should be fine.

yushilingco commented 2 years ago

Well. And because my system is Ubuntu 18.04, the kernel version is 5.4.0-89-generic, and the gcc version is 5.5.0. Is this caused by incompatibility between these versions?

jaycedowell commented 2 years ago

It shouldn't be. I run on Ubuntu 18.04 albeit with a slightly older 4.15 kernel and a newer version of gcc (7.5).

What did you find from looking at how the values of GPU_ARCHS_VALID and GPU_ARCH_LATEST are set?

jaycedowell commented 2 years ago

You could also try building from the autoconf branch of Bifrost to see if the new build system works for you. Once on that branch you would run configure before you run make.

yushilingco commented 2 years ago

I'm sorry I can't find GPU_ARCHS_VALID and GPU_ARCH_LATEST. And can you provide a configure for testing ?Thanks a lot.

jaycedowell commented 2 years ago

How GPU_ARCHS_VALID and GPU_ARCH_LATEST are defined is in src/Makefile. All you should need to do is convert the make shell calls into normal shell calls. Something like the following bash script:

#!/bin/bash

GPU_ARCHS=86
GPU_ARCHS_SUPPORTED=`nvcc -h | grep -Po "compute_[0-9]{2}" | cut -d_ -f2 | sort | uniq`
GPU_ARCHS_VALID=`echo "${GPU_ARCHS} ${GPU_ARCHS_SUPPORTED}" | xargs -n1 | sort | uniq -d | xargs`
GPU_ARCH_LATEST=`echo "${GPU_ARCHS_VALID}" | rev | cut -d' ' -f1 | rev`
echo "Supported: ${GPU_ARCHS_SUPPORTED}"
echo "Valid: ${GPU_ARCHS_VALID}"
echo "Latest: ${GPU_ARCHS_LATEST}"

For the configuretest you'll need to checkout the repository with git and then switch from the master branch to autoconf to access it.

yushilingco commented 2 years ago

The results are as follows: Supported: 30 32 35 37 50 52 53 60 61 62 70 72 75 80 86 Valid: 86 Latest:

jaycedowell commented 2 years ago

The only other thing I can suggest is that you modify src/Makefile and explicitly set GPU_ARCHS_VALID and GPU_ARCH_LATEST to 86 instead of the shell calls that are in there now.

yushilingco commented 2 years ago

Thank you for your advice. I want to check whether the contents in the user.mk file are correct. Can you tell me what the vacancy item represents? Thank you! CXX ?= g++ NVCC ?= nvcc LINKER ?= g++ CPPFLAGS ?= CXXFLAGS ?= -O3 -Wall -pedantic NVCCFLAGS ?= -O3 -Xcompiler "-Wall" #-Xptxas -v LDFLAGS ?= DOXYGEN ?= doxygen PYBUILDFLAGS ?=

jaycedowell commented 2 years ago

Nothing, i.e., no additional flags are specified.

yushilingco commented 2 years ago

It means it doesn't need to be modified and there's no problem, right? I'm sorry I didn't make it clear. What I want to ask is what I need to fill in to match my machine?

jaycedowell commented 2 years ago

No, these shouldn't need to be modified.

yushilingco commented 2 years ago

I am eager to use this framework to realize the real-time processing function, but because I have failed to compile many times, the installation is unsuccessful, and I can't use it further, so I take the liberty to ask you whether you can provide the successfully installed ISO image file?Thank you very much!

jaycedowell commented 2 years ago

You can always try our Docker containers or AWS images. The instructions for them can be found here: https://github.com/ledatelescope/bifrost_tutorial

Did you try building from the autoconf branch?

yushilingco commented 2 years ago

Thank you. Yes, I don't know exactly what to do. If it's convenient for you, can you tell me how to do it~

jaycedowell commented 2 years ago

Sure, try a:

git clone https://github.com/ledatelescope/bifrost.git bifrost_autoconf
cd bifrost_autoconf
git checkout autoconf
./configure
yushilingco commented 2 years ago

That's great, but I can't import successfully:

import bifrost Traceback (most recent call last): File "", line 1, in File "/home/yushiling/anaconda3/lib/python3.8/site-packages/bifrost/init.py", line 37, in from bifrost import core, memory, affinity, ring, block, address, udp_socket File "/home/yushiling/anaconda3/lib/python3.8/site-packages/bifrost/core.py", line 32, in from bifrost.libbifrost import _bf File "/home/yushiling/anaconda3/lib/python3.8/site-packages/bifrost/libbifrost.py", line 41, in import bifrost.libbifrost_generated as _bf File "/home/yushiling/anaconda3/lib/python3.8/site-packages/bifrost/libbifrost_generated.py", line 816, in _libs["bifrost"] = load_library("bifrost") File "/home/yushiling/anaconda3/lib/python3.8/site-packages/bifrost/libbifrost_generated.py", line 541, in call raise ImportError("Could not load %s." % libname) ImportError: Could not load bifrost.

I see. Changing the python version to 2.7 can succeed. Does it not support 3.8?

yushilingco commented 2 years ago

When I run the a simple pipeline example, an error is reported: File "/home/yushiling/anaconda3/envs/learn/lib/python2.7/site-packages/bifrost/blocks/transpose.py", line 73, in on_data bf.transpose.transpose(ospan.data, ispan.data, self.axes) AttributeError: 'module' object has no attribute 'transpose'

jaycedowell commented 2 years ago

It's good to hear that you were able to build and install using the autoconf branch. I think the most recent bf.transpose.transpose error is a bug in the Python high level blocks.

yushilingco commented 2 years ago

Is there any solution?

jaycedowell commented 2 years ago

Try applying the changes in this commit: https://github.com/ledatelescope/bifrost/pull/153/commits/8c035685985a5a419cda851831df8fba0364130d

yushilingco commented 2 years ago

Thank you very much. I've looked for it, but I seem to have missed this information. I'll try it.

yushilingco commented 2 years ago

Can I ask you how to receive UPD packets in real time and copy them to GPU for operation?

jaycedowell commented 2 years ago

At this point you should take a look at the Bifrost tutorials: https://github.com/ledatelescope/bifrost_tutorial

Any additional questions or problems you have that are not related to installation should be submitted as new issues.

yushilingco commented 2 years ago

OK, I see. Thank you again for your patience. I have no problem with the installation now!

jaycedowell commented 2 years ago

Closing with the release of v0.10.0.