open-mpi / hwloc

Hardware locality (hwloc)
https://www.open-mpi.org/projects/hwloc
Other
577 stars 174 forks source link

I/O device support #6

Closed ompiteam closed 10 years ago

ompiteam commented 10 years ago

Updated TODO-list:

hwloc_obj_t hwloc_get_path_obj(hwloc_topology_t topo, const char *path); hwloc_obj_t hwloc_get_fd_obj(hwloc_topology_t topo, int fd);

(the latter may return a network device or a disk device, depending on whether it's a socket or a file. Mmm and how about nfs-mounted files!)

ompiteam commented 10 years ago

Imported from trac issue 5. Created by bgoglin on 2009-09-24T01:01:40, last modified: 2011-04-05T17:17:38

ompiteam commented 10 years ago

Trac comment by bgoglin on 2009-09-25 16:50:04:

Using libpci to scan all devices is fairly easy actually. We can build the busid string from there, and then read /sys/bus/pci/devices/%s/{local_cpus,numa_node}. I have some code to fill the following kind of structure: {{{ struct hwloc_iodev { char busid; / ::. / char *name; / obtain from pci.ids, or NULL _/ unsigned short vendor_id, device_id, device_class; hwloc_cpuset_t localcpus; / mask of procs nearby _/ unsigned numanodeosindex; / numa node nearby */ } }}}

numanode_osindex is probably useless, but I need to check that the kernel never set numanode without setting local_cpus properly. We could group all objects that have the same local_cpus/numanode_osindex, create doubly-linked lists of them, and attach the head of the list the lowest hwloc_obj that covers local_cpus. So each hwloc_obj_t will have two new fields: {{{ hwloc_iodev_t first_iodev, last_iodev; }}} (ABI change: I wonder if we should save 16bytes now to prepare the future). And each hwloc_iodev would have a "rank" within this list and a pointer to its "parent" hwloc_obj.

We only gather the linear list of objects, we don't gather the actual PCI hierarchy (I don't think we care about it). We could filter the device class to only keep GPUs, NICs, and other HPC-related stuff so that lstopo remains interesting.

Once the hwloc core has this, we can add some specific helper such as "open an IB NIC near this cpuset" or "tell me the cpuset near this ibv_device that I just opened" (not hard to implement). We'll need same info for Cuda but we still haven't feedback from their developers about this.

ompiteam commented 10 years ago

Trac comment by bgoglin on 2009-09-28 01:48:07:

What's a properly set cpuset for a device? Do you want to add fake OS numbers to each device when discovering them ? (note for the implementers: modify cpuset of all objects covering this device when adding this fake OS number).

Also, on which level would these objects be stored? Are we breaking the rule that currently puts only objects with same type on the same level? Or do we add a new depth just for pci devices (and one for pci devices) ? (note to implementers: we'd have to make sure we don't put those below PROC) And I guess we'll need a way to return "not comparable" from hwloc_compare_types().

By the way, GPUs will be inside socket in the near future, it's not only about machines and NUMA nodes :)

ompiteam commented 10 years ago

Trac comment by sthibaul on 2009-09-28 13:05:13:

What's a properly set cpuset for a device? Do you want to add fake OS numbers to each device when discovering them ?

No, I meant cpu_set being the set of CPUs near the device.

Also, on which level would these objects be stored? Are we breaking the rule that currently puts only objects with same type on the same level?

Well, I've never assumed this in my code actually :)

And I guess we'll need a way to return "not comparable" from hwloc_compare_types().

Yes.

By the way, GPUs will be inside socket in the near future, it's not only about machines and NUMA nodes :)

Right, and it's all the more interesting to be able to express that, i.e. yes, have not only cache/cores objects in sockets.

ompiteam commented 10 years ago

Trac comment by bgoglin on 2009-09-28 13:22:14:

Also, on which level would these objects be stored? Are we breaking the rule that currently puts only objects with same type on the same level?

Well, I've never assumed this in my code actually :)

I think we should have such a rule, otherwise things may become a huge mess if we ever break it.

And I think we should also document all such rules about somewhere, for instance with the ones about PROC being below, SYSTEM being above, cpusets not intersecting between children, cpusets possibly being empty IIRC (for empty NUMA nodes and devices?), ... Maybe put all this near hwloc_topology_check() and complete what this function actually checks.

And I guess we'll need a way to return "not comparable" from hwloc_compare_types().

Yes

We need to change hwloc_compare_types() as soon as possible then, it is supposed to return <0, 0 or >0 only for now. Otherwise we'll break the ABI when adding devices in post-1.0.

ompiteam commented 10 years ago

Trac comment by sthibaul on 2009-09-28 13:33:05:

Oops, reading again:

Are we breaking the rule that currently puts only objects with same type on the same level?

No, I don't mean that. I mean another level for PCI buses, another level for GPUs, another level for Network boards, etc. But without any strict inclusion order wrt to the levels enclosing CPUs, i.e. a PCI bus could be along sockets in a machine, or along NUMA nodes in a machine. A GPU could be along caches+cores in a socket.

And I think we should also document all such rules

Yes.

We need to change hwloc_compare_types() as soon as possible then, it is supposed to return <0, 0 or >0 only for now. Otherwise we'll break the ABI when adding devices in post-1.0.

We can use MAX_INT, MAX_INT-1, etc. as special values (#defined to some HWLOC macro of course).

ompiteam commented 10 years ago

Trac comment by bgoglin on 2009-09-29 01:41:12:

Actually, maybe hwloc_compare_types() could also return 0 for non-comparable types. People would be advised to check if the types values are indeed equal when hwloc_compare_types returns 0.

ompiteam commented 10 years ago

Trac comment by sthibaul on 2009-10-22 11:40:47:

os devices (e.g. eth0, ide0, hda, sda, etc.) should probably be yet other kinds of objects: a RAID PCI device may have several disks, an Ethernet board may have several net devices, etc. This can be seen e.g. in

/sys/bus/pci/devices//ide0//block/ /sys/bus/pci/devices//net/ /sys/bus/pci/devices//host/target//block/ /sys/bus/pci/devices//drm/

(could use glob() to get these)

ompiteam commented 10 years ago

Trac comment by sthibaul on 2009-10-29 14:43:10:

We could also provide functions like:

hwloc_obj_t hwloc_get_path_obj(hwloc_topology_t topo, const char *path); hwloc_obj_t hwloc_get_fd_obj(hwloc_topology_t topo, int fd);

(the latter may return a network device or a disk device, depending on whether it's a socket or a file. Mmm and how about nfs-mounted files!)

ompiteam commented 10 years ago

Trac comment by bgoglin on 2011-03-29 13:33:19:

big TODO update

ompiteam commented 10 years ago

Trac comment by bgoglin on 2011-04-05 17:17:08:

Fixed in r3381.

ompiteam commented 10 years ago

Trac comment by bgoglin on 2011-04-05 17:17:38:

move to v1.3