Open chenshi3 opened 3 days ago
The compilation process is also slow. I guess it is NFS's issue, and we will try to fix it ASAP.
I used nfsstat
and found that
However, I checked iostat
and found that the I/O load was not high. A possible reason is that we have too many vscode-server threads, which could lock many files.
I'm unable to start VSCode on the cluster, so I'm using SFTP to update code.
I used
nfsstat
and found that
- open_noat (43%)
- lock (24%)
- locku (24%)
However, I checked
iostat
and found that the I/O load was not high. A possible reason is that we have too many vscode-server threads, which could lock many files.
Still super slow, for both compiling and using Vim.
Here are some experimental facts that may be useful for tracing bugs.
Test results on reading and writing using dd
on different machines
The client machine (10.26.200.1)
[121090184@node01 ~]$ dd if=/dev/zero of=~/testfile bs=1M count=512 oflag=direct
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 7.33619 s, 73.2 MB/s
[121090184@node01 ~]$ dd if=~/testfile of=/dev/null bs=1M count=512 iflag=direct
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 5.07156 s, 106 MB/s
Another cluster machine but not for course (10.26.200.13)
$ dd if=~/testfile of=/dev/null bs=1M count=512 iflag=direct
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1.04301 s, 515 MB/s
$ dd if=/dev/zero of=~/testfile bs=1M count=512 oflag=direct
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1.20259 s, 446 MB/s
They show different speeds.
Network latency It is ok (<1ms) to connect to NFS server.
Client communication
op/s rpc bklog
107.39 0.00
read: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
0.330 10.835 32.861 0 (0.0%) 2.309 26.689
write: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms)
0.450 77.069 171.390 0 (0.0%) 5.744 1455.707
avg exe takes 1455ms which is large for writing.
I/O performance on client
[121090184@node01 ~]$ iostat -x 1
Linux 3.10.0-862.el7.x86_64 (node01) 09/18/2024 _x86_64_ (40 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 1.82 0.00 0.33 0.01 0.00 97.84
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.26 0.01 7.72 0.67 339.25 87.93 0.03 3.28 0.45 3.29 0.11 0.09 dm-0 0.00 0.00 0.01 6.96 0.67 339.25 97.49 0.03 3.70 0.46 3.70 0.13 0.09 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.18 0.00 47.80 0.22 54.49 2.49 0.00
Showing that I/O on clients seems to work well.
5. Log traced on `rm build` (`/nfsmnt/121090184/CUHKSZ-CSC4005/project1/rm_trace.log`)
* line 197 *
30229 unlinkat(8, "cmake_clean.cmake", 0) = 0 30229 unlinkat(8, "build.make", 0) = 0
`unlinkat(8,"build.make",0)` took about 10 min to finish.
I suggest checking on the NFS server for details.
When 'rm -r build' command operating in our cluster environment, the deletion process is taking considerably longer than expected (10 minutes or even longer).