open-power / HTX

Apache License 2.0
14 stars 19 forks source link

HTX is crippled by a large number of disks/or multipath disks. #115

Open dougmill-ibm opened 7 years ago

dougmill-ibm commented 7 years ago

When HTX is preparing, during the "su - htx", it appears to be going through disks running "parted" (I see console messages indicating that various disk parition tables are being re-read, which is a known symptom of a deficiency in "parted"). With a large disk enclosure, possible multipath, it seems to be looping and touching the same disk over and over. It would appear that whatever algorithm is being used to discover/configure disks does not scale well at all, or has a bug in it.

It did eventually finish, but took over 20 minutes. There were about 130 disks on the system. There were some multipath disks, part of a 106-disk enclosure.

It also appeared to re-touch many of the disks when HTX started (Running), in spite of all those disks being disabled (halted) in the exerciser.

preeti-dhir commented 7 years ago

Hi Doug, We run parted command (parted -mls command) during setup time (su - htx) to figure out on what all disks or its partitions, HTX can run the test, For multipath, we again run parted and kpartx for that particular device to figure out mpath and its partitions which can be used for testing. Above steps are must for us to check the availability of the device. Now, if number of disks are large on the system, scaling is a challenge because of parted command known issue. We will be looking into it how can we optimize it further.

During HTX run time also, we run parted again on the disks. This is there by design. Reason being, if some one change the system config in-between "su - htx" and "htx" (i.e. start of test), in that scenario, we might run unknowingly on any of the system disk or any of the boot partition, hence corrupting those disks. To avoid this, before actually starting the test, we again run parted.

dougmill-ibm commented 7 years ago

Hi Preeti, Yeah, I have run into the parted issues before, too. We know that fdisk does not have the same problems, but there are rumors (that I have not been able to confirm) that fdisk is going to be removed. If indeed fdisk is going way, another approach would be to simply use the ioctl to read the partition table, although I'm not sure of the structure of HTX code to know whether using C code is practical at that point. But, as you also state, the one think parted does is make us aware of just how many times the partition tables are being read. I understand wanting to make sure things haven't changed since the last time you checked, but just within the "su - htx" I see many many calls to parted for the same disk. This seems like a place for optimization. Of course, even a shell script could optimize some of this, I think, by using the "-nt" test on the block device and the last-captured partition data file.

preeti-dhir commented 7 years ago

Thanks Doug for the inputs. But, right now, we don't have the infrastructure to use IOCTL. That will need lot of change on our side. Need to take a call on this. For now, I am planning to remove parted command invocation at couple of places during su - htx time which I think might not be required. Will provide you a patch for the testing.

dougmill-ibm commented 4 years ago

Note, the latest HTX still has severe scaling issues. I was starting up HTX on a system with 2x69 (2 paths) disks in a SAS enclosure, plus 4x5 FC disks, plus a 4 local disks. HTX took over 45 minutes just to complete the "su - htx". Based on the messages, HTX is doing something fundamentally "wrong" in this area - I keep seeing the NVMe drive "repartition" continually during that time. How many times does HTX need to get the partition table for the same drive? It should only need to access each drive once. I seem to recall stumbling across some script/config file that was calling parted without specifying a disk - which is clearly unscalable as that causes parted to scan all disks each time.