sakjain92 / Fractional-GPUs

Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions
152 stars 37 forks source link

lastest nvidia driver 418.87.00 support #1

Open tjdhc3889 opened 5 years ago

tjdhc3889 commented 5 years ago

Hi,

I try to porting your modification for the uvm driver for the 390 version nvidia driver to the 418.87.00, but I encounter fgpu_server reporting it cannot get color info error. I add some printk inside the nvidia-uvm driver, and find get color ioctl never issued at all. With adding more printk, I find the ioctl seems being hijacked by MPS server, not really issued by the fgpu_server itself. And when fgpu_server try to issue the UVM_GET_DEVICE_COLOR_INFO, MPS hijack it, and issue UVM_UNMAP_EXTERNAL_ALLOCATION and UVM_FREE.

For MPS is closed source, I don't know why it happen. I wonder whether you encounter similar issue when doing this work over 390 version driver.

sakjain92 commented 5 years ago

So the IOCTL UVM_GET_DEVICE_COLOR_INFO is issued to file /dev/nvidia-uvm by persistent/memory.cu As far as 390 and 418 version is concerned, this file set to be used exclusively by nvidia-uvm

Relevant code:

#define NVIDIA_UVM_DEVICE_NAME          "nvidia-uvm"
static int __init uvm_init(void)
{
    NvBool allocated_dev = NV_FALSE;
    // The various helper init routines will create their own minor devices, so
    // we only need to create space for them here.
    int ret = alloc_chrdev_region(&g_uvmBaseDev,
                              0,
                              NVIDIA_UVM_NUM_MINOR_DEVICES,
                              NVIDIA_UVM_DEVICE_NAME);
    if (ret != 0) {
        UVM_ERR_PRINT("alloc_chrdev_region failed: %d\n", ret);
        goto error;
    }
    allocated_dev = NV_TRUE;

    ret = uvm8_init(g_uvmBaseDev);

    if (ret != 0) {
        UVM_ERR_PRINT("uvm init failed: %d\n", ret);
        goto error;
    }
int uvm8_init(dev_t uvm_base_dev)
{
    bool initialized_globals = false;
    bool added_device = false;
    bool initialized_tools = false;
    int ret = -ENODEV;
    dev_t uvm_dev = MKDEV(MAJOR(uvm_base_dev), NVIDIA_UVM_PRIMARY_MINOR_NUMBER);
    NV_STATUS status;

    status = uvm_global_init();
    if (status != NV_OK) {
        UVM_ERR_PRINT("uvm_global_init() failed: %s\n", nvstatusToString(status));
        goto error;
    }
    initialized_globals = true;

    uvm_init_character_device(&g_uvm_cdev, &uvm_fops);
    ret = cdev_add(&g_uvm_cdev, uvm_dev, 1);
    if (ret != 0) {
        UVM_ERR_PRINT("cdev_add (major %u, minor %u) failed: %d\n", MAJOR(uvm_dev), MINOR(uvm_dev), ret);
        goto error;
    }
static const struct file_operations uvm_fops =
{
    .open            = uvm_open,
    .release         = uvm_release,
    .mmap            = uvm_mmap,
    .unlocked_ioctl  = uvm_unlocked_ioctl,
#if NVCPU_IS_X86_64 && defined(NV_FILE_OPERATIONS_HAS_COMPAT_IOCTL)
    .compat_ioctl    = uvm_unlocked_ioctl,
#endif
    .owner           = THIS_MODULE,
};
static long uvm_unlocked_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
    switch (cmd)
    {
        case UVM_DEINITIALIZE:
            return 0;

        ....
         UVM_ROUTE_CMD_STACK(UVM_GET_DEVICE_COLOR_INFO,          uvm_api_get_device_color_info);

I didn't encounter this issue in 390 version. You can try to change the IOCTL number of this specific IOCTL

#define UVM_GET_DEVICE_COLOR_INFO                                   UVM_IOCTL_BASE(2042)

The last IOCTL is

#define UVM_IS_8_SUPPORTED                                            UVM_IOCTL_BASE(2047)

And the one before that, that is in use by NVIDIA vanilla code in 418.88 is

#define UVM_POPULATE_PAGEABLE                                         UVM_IOCTL_BASE(71)

So anything from 72-2046 is available.

But I think it is unlikely MPS is hijacking the IOCTL but something else is happening. I think MPS is mostly a separate kernel module (which I think shouldn't override the ioctl of /dev/nvidia-uvm). Also, if there is also a userspace component of MPS, it shouldn't trap system calls like IOCTLs (It can if it has the right permissions but I don't think it will). Make sure your updated kernel driver is loaded (and not the vanialla nvidia driver which doesn't have the UVM_GET_DEVICE_COLOR_INFO ioctl).

BTW, when get_device_color_info() fails on the line ret = ioctl(g_uvm_fd, IOCTL_GET_DEVICE_COLOR_INFO, &params);, what the value of ret and errorno ? Compare that with linux error codes (https://mariadb.com/kb/en/library/operating-system-error-codes/). You can also put printk in uvm_unlocked_ioctl.

If you want, I can try to help you out if you can give me access to your machine. You can communicate with me offline (sakjain92@gmail.com)

BTW, why are you trying to update device driver from 390 and 418 (I guess 390 should be sufficient or is there any fature in new driver version you want)? And what is your use case with FGPU (Just curious)?

proxion7 commented 1 year ago

I am porting this code to NVIDIA open device driver (515).

The current problem is that when I run MPS Server, the ioctl command in memory.cu calls the nvidia module's nvidia_ioctl instead of the nvidia-uvm module's unlocked_ioctl.

It appears to be the same problem tjdhc3889 had. It looks like NVIDIA changed the behavior of MPS after the 390 driver.

Calling unlocked_ioctl again from nvidia_ioctl is also not called because the module is different.

Any ideas or solutions to these problems? Is there any information about the inner workings of MPS or the source code?

RavanN700 commented 11 months ago

Hello,

Did you solve this problem? Any update?

Thanks.

I am porting this code to NVIDIA open device driver (515).

The current problem is that when I run MPS Server, the ioctl command in memory.cu calls the nvidia module's nvidia_ioctl instead of the nvidia-uvm module's unlocked_ioctl.

It appears to be the same problem tjdhc3889 had. It looks like NVIDIA changed the behavior of MPS after the 390 driver.

Calling unlocked_ioctl again from nvidia_ioctl is also not called because the module is different.

Any ideas or solutions to these problems? Is there any information about the inner workings of MPS or the source code?