madMAx43v3r / chia-gigahorse

221 stars 31 forks source link

cuda_plot_k32 v2.0.0-e161e4b crashes on single Ctrl+c #3

Open Jacek-ghub opened 1 year ago

Jacek-ghub commented 1 year ago

Ubuntu 22:10; single e5-2695 v2, 256 GB RAM, 3060 Ti, k32 / C8 plots

It happened to me already twice with the latest plotter / sink. I did a single Ctrl+c on plotter to have gracefully end, but it immediately terminated connection to plot_sink, sit idle for several seconds, and aborted. There were no error messages coming from the plotter. On the other hand, plot-sink killed pending xfrs; however, it appears to be waiting for new jobs (so plot-sink is sound).

In case it matters where the plotter was in the process, it just started phase 3 and spit out the first [P3] line.

Output from the plot-sink:

Started copy to /mnt/d3/mmx/plots/plot-mmx-k32-c8-2023-02-09-20-55-07393b1a764ac050b18c1f20a003aac6bb998128c9bf9a9153165a8ac1e74b9d.plot (71.2455 GiB)
recv() failed with: EOF
recv() failed with: EOF
recv() failed with: EOF
recv() failed with: EOF
Deleted /mnt/d3/mmx/plots/plot-mmx-k32-c8-2023-02-09-20-55-07393b1a764ac050b18c1f20a003aac6bb998128c9bf9a9153165a8ac1e74b9d.plot.tmp
Deleted /mnt/d2/mmx/plots/plot-mmx-k32-c8-2023-02-09-20-52-356e1890c30f9ee81468475623e836e5ba1d018a39b4f9e13d6f9e5837eb52fb.plot.tmp
Deleted /mnt/d1/mmx/plots/plot-mmx-k32-c8-2023-02-09-20-49-55cfa309baab5a50ba846ac192c697f9e8fa93edc801d79d5bfa96eb68f11789.plot.tmp
Deleted /mnt/d4/mmx/plots/plot-mmx-k32-c8-2023-02-09-20-46-6f60aee4fb1bfe4e981c82e5bb5e52bd2285effe79de22b72ac24cde1b2a9659.plot.tmp

In both cases the plotter / sink combo was running for few long hours. It looks to me that this is a newly introduced problem.

By the way, could you add --version to plot-sink. Right now, instead of version, it prints --help.

madMAx43v3r commented 1 year ago

You sure you did single ctrl+c ? maybe defective keyboard?

Jacek-ghub commented 1 year ago

Nah. That keyboard is working for me for the past 50 years or so :)

It is Microsoft wireless keyboard (maybe not the best, but not that bad overall). So far, I had no problems with it, and also didn't have those problems with previous releases. It is not that I am doing it very often. I did it maybe 3-4 times with this release, and 2 out of those ended stopping the plotter. The other time(s), it was very early in the plotting process (after few minutes), and those worked fine (if that would be any indication).

Actually, in both cases when it happened, the SATA total write speed was going down to drain (say after 6-10 hours of straight plotting), so I wanted to break the process and restart both modules.

Although, I did run Ubuntu updates ;)

Maybe adding some output on double ctrl+c would let remove ambiguity here.

madMAx43v3r commented 1 year ago

There is an output for double ctrl+c.

How did it "end" the plotter? segfault?

Jacek-ghub commented 1 year ago

There was no output on the plotter side at all (that's the reason that I didn't provide it in the post above). It just sat quietly for a minute or so (plot-sink killed xfrs right away), and then I got the prompt back. I didn't check system logs, though.

madMAx43v3r commented 1 year ago

hmm strange... so it killed the process normally basically

Jacek-ghub commented 1 year ago

Yeah, kind of looks like ctrl+c was not trapped in the code, so default hard exit happened.

Jacek-ghub commented 1 year ago

OK, it happened again. Here is the output:

Started copy to @localhost:plot-mmx-k32-c8-2023-02-11-09-40-a0d9646ce30d9612549a99b9f26c107fdf7d70b91d929e0c1fbfde59d0e89ba8.plot
[P1] Table 3 took 12.71 sec, 4294705396 entries, 16791842 max, 66613 tmp, 3.77644 GB/s up, 6.68767 GB/s down
[P1] Table 4 took 14.865 sec, 4294457549 entries, 16786704 max, 66599 tmp, 5.38144 GB/s up, 8.0054 GB/s down
[P1] Table 5 took 14.445 sec, 4293774913 entries, 16784291 max, 66521 tmp, 5.53759 GB/s up, 7.06128 GB/s down
^C
bull@bull:~/mmx$

Looks like ctrl+c was not trapped again. Right after I pressed it, plot-sink terminated all xfrs. However, plotter lingered for about 30 seconds on that ^C before it was gone. This is the third time I stopped it (after ~6 hours), and in all cases the same thing happened.

Maybe this is just a secondary issue. In all cases I was stopping it because xfr to HDs was going down to drain, and NVMe was filling up completely. I am trying to chase this problem but have rather hard time to understand what is going on.

In theory, xfr to HD should take ~6-8 mins (Seagate 18 TB X18). This means that the plot-sink needs 2.5 drives do keep up with 175 sec plots (what I have). When I test 4 parallel mv commands on 4 different plots, I see SATA saturating at 600 MBps, and staying there without a flinch. When I start plotter and plot-sink, I see the same thing. roughly 2.5 drives are being used. I specify 4 of those HDs as targets and keep 5th as overflow (blocked by that chia-block file). However, slowly with time the HD total write speed drops. This time, it was down to 300 MBps (~60MBps per drive across 5 drives). It takes about 6 or so hours to get to that point.

Previous time when the speed degraded like that, I stopped plot-sink, let it finish, and tried to run 4x mv commands. Those were also really slow, actually, in the 30 MBps range per drive. Once I killed the plotter, speeds immediately went up to ~165 per drive, again saturating SATA.

The box has 2 LSI SATA III controllers. One is 2x and one is 8x. Normally, I keep those 4-8 drives on the 8x controller. However, I also tried to put 1 drive on the 2x controller (the other port has boot SSD), but that didn't make a difference. I have also added PCIe SATA III card and put 2 drives there (so only 2 drives per one SATA controller), and that didn't make a difference. I used 3 different NVMes (WD Black 750 and Samsung 970 EVO Plus - all purchased before cost reducing / lowering quality of those drives) and that didn't make a difference.

I boosted plot-sink priority, that didn't make a difference (maybe it took longer to get to this point, not sure). I also turn off swap when plotting.

Here is the CPU usage on that box: image My understanding is that the SATA speed degradation / accumulation of finished plots on NVMe started around 8am (when there is a jump in CPU usage). I saw previously similar increases in CPU usage. If you would like to see different stats, let me know, I will try to dig it out. There is no monitor attached to it, this time I only used ssh to connect to it. Before, I was also using xRDP. It runs Ubuntu 22:10 with the latest updates. System logs are clean.

I don't think that the SATA side is limiting, rather NVMe side is getting slow, or the plotter for whatever reason is not handling NVMe reads properly (seeing those crashes when trying to gracefully exit maybe is the telling here). The NVMe writes are OK, as the plot speeds don't degrade (maybe at all, sure it stops, when NVMe is full, but this is rather irrelevant, as this is a secondary problem and handled properly).

The box is t7610 and with a single CPU and 256 GB RAM, it has only 2x PCIe 3 x16 slots (per CPU). In one I have GPU, in the other PCIe NVMe card (plots sitting on CPU are dead, as I depopulated it). I could get a PCIe extender and purchase 2x/4x NVMe PCIe card and use RAID0 on it. However, at this point I feel that maybe this is not the issue.

Any suggestions?

By the way, do you think that you could open "Discussion" tab, so we could move those type of discussions out of the "Issues" tab?

madMAx43v3r commented 1 year ago

Indeed it wasn't trapped... gonna try something and post an update..

Jacek-ghub commented 1 year ago

Yeah, I just killed it after ~15 plots, and it also crashed. So, when it was working for me, it was most likely the previous version.

By the way,

  1. when you tested it on your h420 box, did you use Ubuntu?
  2. did you use RAID0 for -t? (not sure if it would improve anything having RAID0 on a single x16 PCIe slot)
  3. did you try both plotter and plot-sink on that box running it for several hours (to eventually also get to that degradation point)?
  4. was your setup headless, or the full desktop? (mine is desktop + xRDP)
madMAx43v3r commented 1 year ago

The signal trapping works for me on the newest version. However try the attached version, I'm doing the signal handler differently there to see if it helps: cuda_plot_signal_test.zip

  1. Ubuntu 20.04
  2. single 970 PRO
  3. several days yes, but plot sink on other machine, copy over 10G fiber
  4. headless

It's normal for HDDs to slow down as they get full, especially the last plot will be very slow.

Jacek-ghub commented 1 year ago

It is not about slowing down when HDs get full. I have the same thing happening since I started with empty drives. Once I restart the plotter / sink, all works fine for another ~4-6 hours, and they it quickly starts slowing down. Every time it repeats itself.

Jacek-ghub commented 1 year ago

Also, question about plotter.

  1. does it take regex for -d (looks like not)?
  2. does it understand the chia_plot_block file?
madMAx43v3r commented 1 year ago
  1. no
  2. yes

Once I restart the plotter / sink, all works fine for another ~4-6 hours, and they it quickly starts slowing down.

Is the -t getting full over time?

Jacek-ghub commented 1 year ago

This new build you provided also does not trap ctrl+c for me (over ssh, but plot-sink traps it properly, and I think the plotter was trapping it in previous versions of cuda-plotter-k32)

Is the -t getting full over time?

Not really, looks like it works with 2.5 drives for 6 hours or so (at more or less full SATA speed), and after that it starts rapidly degrading slowly grabbing all 4 drives. When I add at that time the 5th drive, it doesn't change anything, as the per drives speeds decrease.

For non-cuda plotter, I was running it on Rocky Linux 9.2 (I am more familiar with CentOS distros), but for cuda, I couldn't get nvidia drivers install properly on Fedora, so gave up and installed Ubuntu. Maybe the problem is with Ubuntu 22:10.

madMAx43v3r commented 1 year ago

Hmm so strange... I have no ideas anymore...

Jacek-ghub commented 1 year ago

That ctrl+c not working is basically just a nuisance. If you would like to better instrument the code (more logs), I can give it a run at any time.

On the other hand, any suggestions about that HD speeds degrading after several hours of plotting? This one is rather nasty, as I had to break plotting already like 4-5 times on these 4 drives. I have about 1TB to fill them up right now, so let it run with whatever speed it decides to use and will grab the next 4 drives right after.

By the way, do you think that 970 Pro offers any speed advantages over 970 EVO Plus? My understanding is that depending on benchmarks used, EVO Plus looks like a bit faster, but I would assume that it is rather a wash. If that is a case, I also tried wd black 750, and it was the same thing. It rather says that the NVMe is not the culprit. Although, I am still thinking about RAID0 on a single PCIe 3 x16 slot, just to get this part out of the way.

madMAx43v3r commented 1 year ago

970 PRO is 2-bit MLC, which is a lot better for plotting. TLC are very inconsistent in performance...

Jacek-ghub commented 1 year ago

I am having a mixed results with plotter recognizing chia-block file. Just placed that file in the dst folder, and plotter picked up that folder a minute later. On the other hand, I also saw that it is being recognized. Not sure, but kind of looks like destination folders are queued before use (one/two levels deep?), and the file kicks in only when such queue is being exhausted. It at least looks like that.

This was not an issue with plot-sink.

The block-sink file still sits in that folder, and on the second wrap this folder was again used for final xfr.

At this point, I blocked all folders some time ago, and it doesn't look like cuda-plotter respects those blocks. For several cycles, it just picks up those blocked folders for the next rounds of writes.

Not sure after how many cycles, but finally cuda-plotter respected those blocking files.

Would be nice if this would be fixed.

Actually, when plotter starts and tests dst folders, my understanding is that it tries to see whether there is a write access to those folders. Maybe it could be changed to first check whether those blocking files are there, and if so consider those folders good. If there are no blocking files, abort on non-accessible folders. The reasoning behind this is that the structure on the folder used to be mounted (in my case) is owned by root. However, the blocking file has read permissions for all. Once the drive is mounted, it is accessible for writes, and may or may not have that blocking file.

Actually, there is another problem with respecting those blocking files. When that file is removed from the dst folder, it takes really long to get it recognized, even though there are pending files on the NVMe. I think it may be tied down to some plotting event (e.g., when the next plot will be available for xfr). Maybe even worse. The problem with passing those waiting files doesn't go away, but rather the plotter hands over one file to the drive when it finishes another plot, so those drives are starving for jobs, while plots are sitting on NVMe.

Jacek-ghub commented 1 year ago

Once I saw those xfrs blocked, I let the plotter fill up the NVMe. Once the NVMe was full, I let it sit for a couple of minutes (to quiet down). At that point, the CPU usage flatlined at 0%.

Then I started 4x shells with mv commands to the same drives used by the plotter. Those mv commands suffered from the same problem. The total transfer speed was ~250 MBps or so (3x SATA, 1x USB), and NVMe read was at about the same level (i.e., no caching involved). The CPU usage went up to 20%. I let it run like that for a minute or so and killed the plotter.

Once plotter was killed, the total speed immediately jumped to ~800+ MBps (full SATA + ~150 MBps USB drive). Also, the CPU usage dropped to 10%. Also, the NVMe reads were going up to 2 GBps, as system was using RAM cache.

Don't know how to explain it. Clearly, the cuda-plotter is crucial to have this degeneration. One thing that it may be doing is locking up all the RAM, thus virtually zero RAM available for xfrs. I don't know how plotter could influence file operations (i.e., suppress the IO on NVMe, of course, excluding reads/writes with 1 byte buffers), so the RAM gobbled by the plotter is maybe the only thing that is at play.

I took some screenshots and will upload them tomorrow.

madMAx43v3r commented 1 year ago

Yeah you right, it's possible for copies to queue up and wait, and the block file is not checked again afterwards...

madMAx43v3r commented 1 year ago

Now I know why I didn't want to repeat all the work from plot sink the plotter, it's damn hard to make it work in all the edge cases that can arise...

Jacek-ghub commented 1 year ago

Hey no worries about it. Edge cases are always there. As someone said long time ago, "there are no innocent people, they are just not yet interrogated." :)

My frustration is because I have never worked on the QA side and I don't know which part works in which module, so sometimes get mixed up. I don't make runs making notes, just reflect on it once I see those unexpected things. We just need some eyes on it running those cases and in a few days it will be smooth sailing for everyone.

As I said before, you have the best products out there, so this is just a small cost to make it even better.

Also, if you want to put more output for debugging, I will be happy to run that code and report back.

madMAx43v3r commented 1 year ago

try the latest linux version, should be all fixed

madMAx43v3r commented 1 year ago

Don't know how to explain it. Clearly, the cuda-plotter is crucial to have this degeneration. One thing that it may be doing is locking up all the RAM, thus virtually zero RAM available for xfrs. I don't know how plotter could influence file operations (i.e., suppress the IO on NVMe, of course, excluding reads/writes with 1 byte buffers), so the RAM gobbled by the plotter is maybe the only thing that is at play.

It could be RAM fragmentation yes...

Jacek-ghub commented 1 year ago

Thank you. Will do.

Here is what I get for the ctrl+c:

[P3] Table 5 PDSK took 5.901 sec, 3531093033 entries, 13812604 max, 54870 tmp, 5.5111 GB/s up, 7.92243 GB/s down
^C^C^C^C^C^C
^C^C^C
^C^C^C
^C^C^C
^C^C^C
bull@bull:~/mmx$

It looks like the code was stuck on some blocking call for 30 secs or so, and was not responding to anything, and then crashed.

madMAx43v3r commented 1 year ago

It looks like the code was stuck on some blocking call for 30 secs or so, and was not responding to anything, and then crashed.

That blocking call is the kernel cleaning up memory, it takes a while to free 200G of RAM.

madMAx43v3r commented 1 year ago

Still very strange the signal trapping doesnt work for you, maybe the terminal changes the signal to SIGKILL or something?

Could attach gdb via sudo gdb -p PID and catch the signal to see what's what.

I'm catching SIGINT and SIGTERM.

Jacek-ghub commented 1 year ago

plot-sink traps ctrl+c without any problems. I am not sure whether the plotter was trapping it before (didn't pay attention, and quickly started using plot-sink).

I have a test run that I would like to run for 5-8 hours. So far it looks like the problem is gone. It is 2 hours in the plotting, and just saw write speeds are pushing up to 280 MBps or so. Basically, one and a half drive is needed with those speeds. Before, it was already in 150-170 MBps ranges at that place and 3-4 drives.

Also, it looks like there is less system CPU used, but more user. The overall CPU usage dropped as well.

Sorry that I couldn't express the issue properly earlier. I was looking at everything on the box. Actually, looks like in the process I may have damaged the second socket, so most likely will need to get a new mobo for the box (basically was trying to add more RAM (through the second CPU) to it and screwed something up).

So, not sure about you, but this is the first day without much headaches on my side.

Jacek-ghub commented 1 year ago

Well, some degradation has started. We need more than 2 drives basically all the time (say 2.5 drives). At the moment, write speeds are down to ~180 MBps. In bpytop, there is some choppiness in HD writes. We are 4 hours into the plotting. However, it holds much better comparing to the old run.

Update 1 (5 hours in): I saw few HD writes with speeds down to 130 MBps level. Although, we are still on 3 HDs. But, we are flirting with a potential collapse like before.

Not analyzing this chart too much, the increase in the system level CPU usage grows (brownish-yellow at the bottom). Too early to say, but user (navy blue in the middle) is slowly reducing. It could be also argued that iowait gets bigger percentage (although, that could be just jitternes). image

To it in a perspective, here is the last 12 hours' worth of data that on the left has the part when the plotting kind of collapsed (HD writes in the range of 60-70 MBps / 4 drives, plotting times increase as the NVMe chokes, sometimes from ~170 secs into the 500-600 secs ranges (not all, only those that were hosed by slow HD writes). What is worth to notice is that at that point, the user part of the CPU usage was kind of shot. image

Actually, checking back on the test output, we are officially in 120 MBps range. 4 HDs are at play, and the chopines of HD writes started. I think soon it will hit those ~60 MBps levels, as I don't thin there is any recovery from this point. This is bpytop output that shows the chopines: image Those brownish bars per HD (d1-4) don't have any gaps (other than when dumps to HD are done) when all is well. Also, when I did that screenshot d2 is at 23 MBps, both d1 and d4 are in 66 MBps.

I will wait for it to choke the NVMe (and provide an update) and will try to capture ctrl+c at that point.

Update 2 (6 hours in): Well, as before, once the write speeds are around 150 MBps, it starts going down faster. Right now, the NVMe is choked, and speeds are around 80 MBps (if plotter reports it right, as looking at bpytop, they seem to be lower). So, it looked very promising initially as it held in over 250 MBps for longer, but actually overall degraded at roughly the same speed (6 h in and plotter was done).

Here is the CPU chart that shows steady increase in system CPU usage (brownish at the bottom), reduction in user level percentage (navy blue), and increased fluctuation in iowait (dark red at the top). image

Although, the whole time, plots were created in ~175 secs. When NVMe started choking, and there was less HD writes, they actually improved down to 165 secs or so (as expected - no more, or very little NVMe reads).

So, at the moment, it looks to me that on older systems - a single e5-2600 v2, 256 GB RAM, HDs locally attached (Dell t7610, in my case), cuda-plotter cannot be left to plot unattended. Most likely, if the system is dual (multi) CPU, the extra RAM may be instrumental in at least slowing down this degradation, if not removing it completely (I think in your test, your remote plot-sink acted as local RAM extension, thus you have not seen this problem).

Hopefully, something from what I've collected could help to better understand this issue, and we will have a fix soon.

Jacek-ghub commented 1 year ago

Well, here you go.

Output from the plotter:

P1] Table 2 took 7.742 sec, 4294823798 entries, 16789648 max, 66723 tmp, 4.1333 GB/s up, 6.58748 GB/s down
[P1] Table 3 took 9.541 sec, 4294503292 entries, 16788964 max, 66697 tmp, 5.03075 GB/s up, 8.90894 GB/s down
Flushing to disk took 22.25 sec
Started copy to /mnt/d1/mmx/plots/plot-mmx-k32-c8-2023-02-12-22-16-46233253a4386805cc1bcf99fb552691b15aa036613a978df71c46ad8f20a0b6.plot
[P1] Table 4 took 12.79 sec, 4293907263 entries, 16786162 max, 66589 tmp, 6.25421 GB/s up, 9.30416 GB/s down
^Cbull@bull:~/mmx$

Not a graceful exit.

Here is what gdb thinks aobut it:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb14dce25a7 in __GI___wait4 (pid=-1, stat_loc=0x7fff67fdb5b0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb)
(gdb) continue
Continuing.

Program received signal SIGINT, Interrupt.
0x00007fb14dce25a7 in __GI___wait4 (pid=-1, stat_loc=0x7fff67fdb5b0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
(gdb) continue
Continuing.
[Inferior 1 (process 47229) exited with code 0202]

By the way:

  1. plotter runs on Ubuntu
  2. putty is the terminal (on Win)
  3. plot-sink properly responds to c+c (so this is not exactly putty / win issue)
  4. run xRDP to Ubuntu, started plotter in the shell, it hard aborted after getting c+c (same is in putty)
  5. don't have on that box keyb, mouse or monitor, but can make it if needed (using my trusted kbd ;) )
madMAx43v3r commented 1 year ago

Your increased system usage might be RAM fragmentation, check kernel threads via top -H.

Regarding the trapping: image

There's not much to it, it works for me and so far nobody else has this problem...

Jacek-ghub commented 1 year ago

Could you add a line to print that "num_plots" before you check that condition in the next build?

As far as that RAM fragmentation, what can I do about it?

madMAx43v3r commented 1 year ago

Could you add a line to print that "num_plots" before you check that condition in the next build?

It's the -n argument that you provide.

As far as that RAM fragmentation, what can I do about it?

There's a million knobs you can tune in the kernel, I'm not an expert on it...

Jacek-ghub commented 1 year ago

Could you add a line to print that "num_plots" before you check that condition in the next build?

It's the -n argument that you provide.

Sorry, what I meant to say is could you print that value right before handling that c+c. My feeling is that for some reason that if statement may be failing there. (I am running it with "-n -1", so we know what it should be.)

We have seen in gdb that c+c was trapped, and we know that is being handled in plot-sink (most likely the same code), so one of the possibilities is that for whatever reason plotter is not processing it is because that that if statement is getting false on that condition.

madMAx43v3r commented 1 year ago

num_plots would have to be 1 or 0 to make the if fail, in which case no plot or only 1 plot will be created.

Jacek-ghub commented 1 year ago

I follow that, and that works as you described for me as well, so we are good here.

I would just want to confirm that this is not somehow localized problem that is preventing proper handling of c+c in my case.

Or actually, if you could add that print line inside of that if statement. That would tell us that c+c was captured by the plotter, but the handler was not executed for some reason.

madMAx43v3r commented 1 year ago

It just occurred to me I should check the return value like this: image

Let's try this: cuda_plot_signal_check.zip

Jacek-ghub commented 1 year ago

Thank you. I have another run going on right now (specified -t and -d to be the same and am using a script with mv to do the xfr). I am not sure what to expect (rather nothing), but as I said, I am out of bullets. I will give it a shot after that.

Adding an extra else statement there (to the outside if) gives you an opportunity to say "Thank you for using ..." I would not pass on that opportunity.

Maybe instead of having just those separate inner if statements, you could do if / if else / else, and in the final else just print "signal not handled" or something like that. One extra line not invoked too often, but giving extra feedback to whoever triggered it.

Kind of a dumb question. Why that leading if statement is not just:

if (1 != num_plots) {
  ...
}

Do you expect someone to set that n value to something in between 0 and 1, or rather you modify num_plots and it can just happen to be modified to such a value?

Basically, seeing that code for the first time (when you posted it previously), I though, that you modify that value, thus I was making rather dumb assumptions.

madMAx43v3r commented 1 year ago

you can specify -n 0, that's why

Jacek-ghub commented 1 year ago

Oh, yes that clearly shows that we see only what we want to see. I didn't think that 0 is actually an option.

OK, run the modified code, here is the output from gdb when c+c happened:

[Thread 0x7f3177fff000 (LWP 637239) exited]
[Thread 0x7f31777fe000 (LWP 637240) exited]
[Thread 0x7f31767fc000 (LWP 637241) exited]
[New Thread 0x7f31767fc000 (LWP 637256)]
[New Thread 0x7f31f0ffd000 (LWP 637257)]
[New Thread 0x7f31777fe000 (LWP 637258)]
[New Thread 0x7f3175ffb000 (LWP 637259)]
[New Thread 0x7f31f1fff000 (LWP 637260)]
[New Thread 0x7f31f17fe000 (LWP 637261)]
[New Thread 0x7f3177fff000 (LWP 637262)]
[New Thread 0x7f31757fa000 (LWP 637263)]

Thread 1 "cuda_plot_k32-c" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x555777eb1b64) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb)
Continuing.
[Thread 0x7f31767fc000 (LWP 637256) exited]
[Thread 0x7f31f0ffd000 (LWP 637257) exited]
[Thread 0x7f31777fe000 (LWP 637258) exited]
[Thread 0x7f3175ffb000 (LWP 637259) exited]
[Thread 0x7f31f1fff000 (LWP 637260) exited]
[Thread 0x7f31f17fe000 (LWP 637261) exited]
[Thread 0x7f3177fff000 (LWP 637262) exited]
[Thread 0x7f31757fa000 (LWP 637263) exited]
[New Thread 0x7f3176ffd000 (LWP 637288)]

Thread 1 "cuda_plot_k32-c" received signal SIGPIPE, Broken pipe.
__GI___libc_write (nbytes=114, buf=0x5557776cf0d0, fd=1) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) continue
Continuing.
Couldn't get registers: No such process.
(gdb) [Thread 0x7f3176ffd000 (LWP 637288) exited]
[Thread 0x7f31f759d000 (LWP 635896) exited]
[Thread 0x7f31f7fff000 (LWP 635895) exited]
[Thread 0x7f31fc99d000 (LWP 635894) exited]
[Thread 0x7f32054f4000 (LWP 635889) exited]
[Thread 0x7f31fd19e000 (LWP 635893) exited]
[New process 635889]

Program terminated with signal SIGPIPE, Broken pipe.
The program no longer exists.
madMAx43v3r commented 1 year ago

received signal SIGPIPE, Broken pipe

That would be the issue, let me look into it...

madMAx43v3r commented 1 year ago

Try latest linux version, I'm ignoring SIGPIPE now

Jacek-ghub commented 1 year ago

OK, downloading now.

Jacek-ghub commented 1 year ago

I run the following test:

  1. k32 / C7, -n 8
  2. single HD as -d (to get a small backlog on the NVMe)
  3. ^C at 5th plot after P1/T2 (3 plots sitting on the NVMe)
  4. there was no output after that anymore
  5. plotter continued working to the end, finishing all 8 plots and xfring them to HD
  6. once finished, the output moved back to the prompt

plotter output:

Plot Name: plot-mmx-k32-c7-2023-02-15-11-43-9eafd43860a4015fbe69c1ff21da944a9315c4d018857b1d13191e50971632dd
[P1] Setup took 1.098 sec
[P1] Table 1 took 3.855 sec, 4294967296 entries, 16790192 max, 66666 tmp, 0 GB/s up, 8.81978 GB/s down
[P1] Table 2 took 8.169 sec, 4294897985 entries, 16789982 max, 66640 tmp, 3.91725 GB/s up, 6.24314 GB/s down
^Cbull@bull:~/mmx$

dbg output:

[New LWP 82441]
[New LWP 82442]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558cb8ff40f4) at ./nptl/futex-internal.c:57
57      ./nptl/futex-internal.c: No such file or directory.
(gdb) continue
Continuing.
[Thread 0x7f63acc8f000 (LWP 82435) exited]
[Thread 0x7f63ad490000 (LWP 82436) exited]
...
[New Thread 0x7f633ffff000 (LWP 104171)]
[New Thread 0x7f633effd000 (LWP 104172)]
...
Thread 1 "cuda_plot_k32-c" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558cb8fc70f4) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) continue
Continuing.
[Thread 0x7f633d7fa000 (LWP 103787) exited]

Thread 656 "cuda_plot_k32-c" received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f63ae492000 (LWP 103788)]
__GI___libc_write (nbytes=33, buf=0x558cb880a190, fd=1) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) continue
Continuing.
[New Thread 0x7f633d7fa000 (LWP 107045)]
[Thread 0x7f63ae492000 (LWP 103788) exited]
[Thread 0x7f63297fe000 (LWP 104165) exited]
...
[Thread 0x7f63c8ffd000 (LWP 81428) exited]
[Thread 0x7f63af7fe000 (LWP 81434) exited]
[New process 81428]
[Inferior 1 (process 81428) exited normally]
(gdb)