leptonai / gpud

Apache License 2.0
154 stars 8 forks source link

gpud scan (dmesg) fills the stderr with kernel logs, and fails with unknown flags #32

Closed eicca closed 4 days ago

eicca commented 3 weeks ago

Hi again :)

I'm running

sudo gpud scan 

And it works well, except that after displaying the results of the scan, it shows the kernel log from dmesg. This makes it hard to read the scan report, so I use just this for now:

sudo ./gpud scan 2>/dev/null

It would be amazing if there would be an option to not display dmesg by default.

Thanks in advance :heart:

gyuho commented 3 weeks ago

gpud scan might have failed, thus dumping all the output to your terminal.

Could you share the error message for your gpud scan command?

And regardless, https://github.com/leptonai/gpud/pull/33 will remove that verbose output from the scan command.

Thanks for the report.

eicca commented 2 weeks ago

The gpud scan does all the steps well until the

⌛ scanning dmesg for 5000 lines

And then it errors there with:

gpud: failed to execute command: exit status 1 ([    0.000000] kernel: Linux version blah blah and other kernel details

unfortunately, it doesn't really say why it fails or what kind of command was it.

On my local GPU machine I don't have such error.

gyuho commented 2 weeks ago

On my local GPU machine I don't have such error.

@eicca Can you share your output gpud --version?

gyuho commented 2 weeks ago

We've fixed a few releases ago, and https://github.com/leptonai/gpud/releases/tag/v0.0.1-alpha7 has been released :)

Please give it a try and let us know if things are still breaking :)

eicca commented 2 weeks ago

Hi, I tried the new release and the kernel log doesn't fill the terminal anymore :+1:

However, the dmesg scan error is still there:

⌛ scanning dmesg for 5000 lines
{"level":"warn","ts":"2024-08-29T13:28:24Z","caller":"process/process.go:221","msg":"command exited with non-zero status","error":"exit status 1","cmd":"/usr/bin/bash /tmp/tmpbash3729988194.bash","exitCode":1}
{"level":"warn","ts":"2024-08-29T13:28:24Z","caller":"process/process.go:228","msg":"process exited with error","error":"exit status 1"}
exit status 1

Version:

gpud version v0.0.1-alpha7
gyuho commented 2 weeks ago

@eicca Thanks for the confirmation.

I suspect this is an OS-specific error.

Could you share your output for

sudo dmesg --ctime --nopager --buffer-size 163920 --since '1 hour ago'

command?

(this is what the gpud scan command is running)

eicca commented 2 weeks ago

I get this error:

> sudo dmesg --ctime --nopager --buffer-size 163920 --since '1 hour ago'
dmesg: unrecognized option '--since'

I guess in this version of dmesg this flag is not yet implemented.

> dmesg --version
dmesg from util-linux 2.34

On my local GPU machine with newer ubuntu it works well. The dmesg version there is dmesg from util-linux 2.39.3

I think one option would be to use a journalctl as a backup, something like this:

sudo journalctl -k --since "$(date --date='1 hour ago' '+%Y-%m-%d %H:%M:%S')" --no-pager
gyuho commented 2 weeks ago

guess in this version of dmesg this flag is not yet implemented.

Oh makes sense. Will look into this, thanks for the confirmation!

gyuho commented 1 week ago

@eicca We've just released https://github.com/leptonai/gpud/releases/tag/v0.0.1-alpha8.

Please let us know if you find any more issues :)

gyuho commented 4 days ago

Closing. Please try https://github.com/leptonai/gpud/releases/tag/v0.0.1-alpha9 and feel free to reopen if there's still an issue.

eicca commented 4 days ago

Thanks @gyuho for the fix and sorry for the lack of reply! Will try a new version at some point!