pytorch / glow

Compiler for Neural Network hardware accelerators
Apache License 2.0
3.21k stars 688 forks source link

[RFC] Device Resident and Specific Tensors #3629

Open nickgg opened 4 years ago

nickgg commented 4 years ago

Device Resident and Specific Tensors

The Tensor class in Glow is used extensively as an abstraction for typed memory on the host and is used in the compiler, runtime interfaces and backend specific execution code. It’s able to serve a wide range of uses in the code by being fundamentally a thin wrapper around an underlying buffer and adding extra functionality through helper functions (such the TensorHandle).

As we get closer to production with Glow (and especially the runtime components) we have encountered some use cases that our Tensor abstraction does not handle well, or at all. This document addresses a few categories of data that we have no good abstraction for in Glow and need a solution for: Tensors with data that exist on the device but not the host (Device Resident) and Tensors which exist on the host in a device-specific format (e.g. aligned or pinned for a specific accelerator). Currently we handle these cases implicitly (e.g. we just know that this tensor must be aligned correctly), or we transform the Tensor before and after each usage (e.g. in DeviceManager::execute).

The solution we find to handle these issues may affect other open questions about how to transfer data between devices (e.g. P2P) and how to represent non-standard Tensors (e.g. large tensors we can’t fit in Host memory).

This is a large document, but I believe the implementation can be fairly uncomplicated (so feel free to skip to the end).

Use Cases

Training (Device Resident)

When training weights we have a situation where there are mutable tensors which are modified on every execution, but are not inputs or outputs (until the run is complete). Our current DeviceManager implementations cannot handle this optimally, and must transfer the weights tensor from host to device at the start of each execution (minibatch) and back to the host at the end of each execution. This is a substantial overhead which can dominate training speed.

To support training we need to find a solution where trainable tensors are initialized before the first run, retrieved after the last epoch but on all runs in between only the inputs (for the forward pass) and outputs (for loss calculation) are transferred to/from the device.

image

The above example demonstrates a typical training workflow where epochs are marked in green and each minibatch in purple. The weights are initialized at the beginning of the run, and are modified by the backward pass of each minibatch but reading and interpretation of the results is done only at the very end of the run. Every minibatch needs new inputs loaded from the training set, and the user may want to pull down the outputs of each minibatch to do loss calculation on the host.

For this simple (and common) example use case we could encode this information into the graph itself. In fact we may not even need any changes to Glow core to do so, simply mark the weights as constants and manually retrieve them from the card at the end of the training run (assuming the device does not prevent mutation of glow Constants).

We can also handle Tensor location dynamically, by transferring each Tensor to the location we need it at the time that we need it, which would allow us to do the least transfers without any special knowledge or annotation required in the user’s graph.

Device Pinned Tensors (Device Specific)

Some vendor’s backend libraries require host memory to be in a particular format before initiating DMA or transfer to the device - most commonly with a defined alignment scheme. Some backends are “helpful” and provide a Host Tensor object that abstracts these details, the consequence being that we must use their allocator for Tensors or suffer an unnecessary copy for each Tensor that is transferred to the device (the NNPI Backend demonstrates a particularly good example of this issue).

The diagram below is a summary of Tensor memory copies done in glow for each request:

image

From this you can see that on every request we copy each input Tensor at least twice (The input copy in Onnxifi is not always necessary but IIRC we do it most of the time currently), and each output tensor three times (including DMA).

The grey sections in bold refer to copies that are done to transform the input Tensor into the format required by the Vendor’s runtime stack. We can eliminate these copies by creating glow::Tensors with the correct format in the Onnxifi layer. Doing this requires two changes to glow:

This may not immediately seem related to the Device Residency example above, but both requirements can be conceptualized around the idea of opaque Device Specific metadata attached to a Tensor.

Colocated Partitions (Device Resident)

After partitioning a graph it’s possible for multiple partitions to be assigned to the same device (e.g. on two devices if we parallelize the inputs to a DAGNode). In this case if the output of one node was the input of another we would not need move the Tensor between partitions, but we currently transfer all outputs back to the host and all inputs up the the device. This is unnecessary overhead we can remove to improve latency and throughput.

In the example below a network with four inputs and one output tensor is partitioned into three components: two of which are run in parallel on different devices and the third combines their activations and produces the output:

image

Clearly all four inputs must be transferred to a device (1 and 2 to Device 1, 3 and 4 to Device 2), and the Output must be transferred back to the Host if we’re going to do anything with it. The two activations are different: Activation 1 must be transferred from Device 1 to Device 2, either via the Host or directly via a P2P transfer. Activation 2 is interesting in that nothing must be done to it at all, in fact anything we do to Activation 2 is work that we can optimize away.

To support the most efficient flow here we need to be able to leave Activation 2 on the device intentionally at the end of Partition B, and then when executing Partition C know that it’s already loaded so we don’t overwrite this with another Host Tensor. This requires the Glow runtime’s Executor to be able to manipulate Tensors which do not have data present on the Host (ie. just the Tensor metadata).

In the DeviceManager we need to know a) if the Tensor is Device Resident, b) which Device it is resident on, and c) where it actually is on the device (ie. the pointer to pass into execute). We’ll need to synchronize this information from one partition to the next.

This is similar to the first training use case but more complex, because it is not known at compile time what to do with the output activations of each partition. Consider the above example in the case that Partition C is the most expensive operator, and the idea provisioning strategy is to assign Partition A to Device 1, Partition B to Device 2 and Partition C to both devices. In that case the actual device used to execute Partition C (and therefore what to do about the activations) is decided by the Executor at the time of scheduling C. To support this optimally we will need to transfer the activations at that time (i.e. only when the destination device is known).

Enabling Peer-to-Peer DAta Transfer (Device Resident)

Peer-to-Peer transfer is another case where Tensors may be manipulated while resident on a device. If we think about DMA transfers to a device being an operation on a Tensor (or DeviceManager) then we could trigger P2P transfers in a similar way.

Assuming that we can model transferring data to the device as converting a Host Tensor to a Device Resident Tensor, and saving data back from the device as the the inverse then we can extend this to modeling P2P transfers by converting one Device Resident Tensor into a different Device Resident Tensor. It’s worth noting that this approach can only express P2P data transfer (“send” side), not P2P control flow (“receive” side).

Even if we do want the performance, or interoperability, of supporting Send and Receive nodes in the Glow graph, modeling P2P via Device Resident Transfers (which we need a solution for anyway) may allow us to support P2P faster and provide support while we work out the best thing to do.

Tensor Streaming (Non Resident!)

Another related and potentially overlapping feature we need to consider is the case of Tensors that are not resident on the Host or on any Device. Support for this will be necessary in the case of very large networks where all Tensors cannot fit on the Host at the same time. In this case Glow will need to be able to manipulate Tensors without access to their data, which is similar to when the Tensor is resident on a Device.

We could model this as a Device Resident Tensor where the device is a network device rather than one of the managed accelerators. They would likely share the same kinds of methods (Create, Load, Release) and metadata (Type, location and graph mapping). For most of Glow which cares only if the Tensor contents are manipulable we could handle the different kinds of Device Resident Tensor in the same manner, which could simplify handling code.

There are also some differences between the use cases which may make a compiled implementation introduce complexity rather than reduce it:

Existing Runtime Functionality

Before developing an implementation approach it’s worth remembering what Glow can already do, because any approach we take here should not restrict currently supported functionality.

Specifically, Glow supports these features we should take into account:

What other Frameworks Do

XLA

XLA has the TransferManager interface that is subclassed by each backend (somewhat analogous to our DeviceManager, but does not handle execution) which provides a common interface for converting Literal objects (analogous to our Tensor) into ShapedBuffer objects (like a simplified Tensor, just buffer ptr, Type, Backend kind and Device ID).

See:

TVM

In TVM device transfers are part of the DeviceAPI (again similar to DeviceManager, but does not handle execution) which is subclassed by each backend and provides a common interface for copying data from one address with an attached TVMContext to another address with an attached context. Each DLTensor (analogous to our Tensor) holds a DLContext (typedefed to TVMContext) which contains the Device type and Device ID.

In this example all Tensors are necessarily bound to a device and it is up to the caller to manage device tensors.

See:

Pytorch

In Pytorch torch.Device is a required property of torch.Tensor: every Tensor is bound to a specific Device (“cpu”, “cuda:0”, etc). Transferring a Tensor to a device is equivalent to copy constructing a new Tensor with a different Device. Deciding how and when to transfer from host to device and back is up to the user in their program.

See:

Summary

Each framework has a mechanism of denoting the device associated with a Tensor (or Tensor-like object) by additional state within the Tensor. In each case the transfer to device is done via the creation of a new object with a different Device.

There are a few reasons why we may want to consider something different:

Proposed Implementation

My proposal is to separate how and when transfers are done from where Device Residency information is stored. The DeviceManager handles the transfer logic and stores Device Residency information in the Tensor directly.

Storing Device Residency INformation

After copying a Tensor’s contents to the device we need to store two pieces of information to retrieve it: which device it resides on, and where on that device it is.

DeviceResidencyInfo could look like this:

struct DeviceResidencyInfo {
    DeviceManager *deviceManager{nullptr};
    void *context{nullptr};
}

Then we need to associate that Info with a Tensor and store it. There are a few options here:

With that said, I think the best overall approach is including DeviceResidencyInfo in the Tensor class:

class Tensor {
 private:  
  DeviceResidencyInfo residency_;

 public:
  /* \returns true if this Tensor's most up to date data exists on a device */ 
  bool isDeviceResident();

  /* Sets DeviceResidencyInfo in the Tensor, registering it as existing on a device */ 
  void setDeviceResidency(const DeviceResidencyInfo& info);

  /* \returns the DeviceResidencyInfo associated with this Tensor, its fields may
     be null */ 
  const DeviceResidencyInfo& getDeviceResidency();

  /* Removes any associated DeviceResidencyInfo. Note that this does not affect the
     device in anyway, in particular this does not free associated memory on the
     device. */
  bool clearDeviceResidency();
}  

It’s worth repeating that only the metadata is in the Tensor object, any logic regarding transfers should exist elsewhere.

Another advantage of using the existing Tensor object to represent Device Resident Tensors is that we can use common infrastructure for pooling and organisation: for example it would allow us to replace the HabanaIOBufferPool with a standard glow::TensorPool which manages Tensors with DeviceResidencyInfo.

Transferring To and From the Device

The DeviceManager is the right place for methods that transfer Tensor data because transfer logic is inherently device specific and the DeviceManager is the runtime’s backend-specific component. It’s possible that we could use what we know about the various backend implementations to implement common code with backend specific hooks, but this would increase complexity - particularly if we don’t get assumptions about future devices right.

I propose the following API:

class DeviceManager {
  ...

  /* Copies the contents of \p tensor to the device. If \p location is set 
     it represents a device specific location for the tensor. */
  bool transferToDevice(Tensor& tensor, void *location = nullptr);

  /* Copies the device buffer associated with \p tensor to the host. If \p
     release is true, frees the device memory */
  bool transferFromDevice(Tensor& tensor, bool release = true);

  /* Releases the device buffer associated with \p tensor without copying
     it's contents, frees device memory */
  bool releaseDeviceTensor(Tensor& tensor);

};

These three methods would need to be implemented in every DeviceManager as the Executor (at least) will need a consistent way of managing Tensor location. In the case of devices that don’t support Resident Tensors or where that concept doesn’t make sense (e.g. CPU and Interpreter backends) transferToDevice and transferFromDevice can be no ops.

One decision that is important to be consistent about is what to do about the Tensor’s host memory while it is Device Resident. We could potentially free it, reducing Host memory pressure, and reallocate it on the transfer back to the host. Not certain if the memory win would be worth the cpu cost of the free+malloc (I think its not), but I am sure we’d complicate a lot of calling code (specifically in unit tests which expect to be able to verify the output buffer defined before execute). I propose we don’t do anything to the Tensor’s memory in transferToDevice and simply overwrite it in transferFromDevice.

We also need to decide on what to do about the case where a user attempts to modify a Tensor while it is Device Resident. It’s convenient for many reasons to store DeviceResidencyInfo in the Tensor, but it’s also awkward that a Device Resident Tensor is not a valid object for almost all of a Tensor’s method since the data buffer isn’t available. We could try something like automatically transferring the Tensor back to the Host before operating on it but that could have unintended consequences. I suggest we just assert in these cases.

For a more detailed example, take a device that exposes 3 methods for managing device memory: deviceMalloc, deviceFree and deviceMemcpy and represents Device Tensors via pointers in the device’s address space. We can implement Device Residency with the following code:

bool ExampleDeviceManager::transferToDevice(Tensor& tensor, void *location) {
    DeviceResidencyInfo info;

    tensor.getDeviceResidency(info);
    if (info.deviceManager != nullptr) {
      if (info.deviceManager->getBackendKind() == this->getBackendKind()) {
        DO_P2P_TRANSFER(info.deviceManger->GetDeviceId(), 
            info.context, this->getDeviceId(), location);
       } else {
            info.deviceManager->transferFromDevice(tensor, true);
       }
    }

    if (!location) {
      location = deviceMalloc(deviceID, &location, tensor.getSizeInBytes());
    }

    deviceMemcpy(deviceID, location, tensor.getUnsafePtr(), 
        tensor.getSizeInBytes(), Device::);
    tensor.setDeviceResidency({this, location});
    return true;
}

bool ExampleDeviceManager::transferFromDevice(Tensor& tensor, bool release) {
    DeviceResidencyInfo info;
    tensor.getDeviceResidency(info);
    assert(info.deviceManager == this);

    if (info.context == nullptr) {
        return false;
    }

    deviceMemcpy(deviceID, tensor.getUnsafePtr(), info.context, 
        tensor.getSizeInBytes(), Device::DeviceToHost);

    if (release) {
        releaseDeviceTensor(tensor);
    } else {
        tensor.resetDeviceResidency();
    }

    return true;
}

bool ExampleDeviceManager::releaseDeviceTensor(Tensor& tensor) {
    DeviceResidencyInfo info;
    tensor.getDeviceResidency(info);
    assert(info.deviceManager == this);

    if (info.context != nullptr) {
        deviceFree(deviceID, info.context);
    }

    tensor.resetDeviceResidency();
}

In this example we store the device address pointer in DeviceResidencyInfo.context and use the deviceMemcpy function to move data into and out of that buffer. This concept maps pretty well to how memory is managed in GPU backends, as well as the Habana backend.

When to Transfer Tensors

It would be valuable also to have a common scheme or ruleset for when to transfer Tensors to and from the device. This is complicated, since the optimal thing to do in each situation will depend on both the use case and details of the device. There are couple of high level options here:

image

The diagram above demonstrates this on a graph with two partitions. In our current runtime the “Copy To” and “Copy From” blocks refer to copying all Tensors, but they could be modified to copy only the Tensors we did not want to remain on the device.

There are two complexities here: how to encode this Tensor transfer information at compile time (which is not graph based, at least currently) and what to do in the case where a partition could run on multiple devices (the best thing is probably to always copy to host in that instance, which is suboptimal).

image

This will increase complexity at the cost of simplifying DeviceManagers - which is a bad trade for two reasons: firstly the Executor is only used when using the full HostManager based runtime, which increases the amount of extra work the user must do outside of that environment, and secondly the Executor is not a good fit for any device specific information. An example may be that while we generally want to leave outputs on the device if they are used again there, a certain device may have extremely constrained memory and want to get all outputs off as a rule - this distinction is difficult to achieve systematically with this approach and would require special cases in the Executor code.

image

Similar to how in the current implementation we loop over all placeholders in the PlaceholderBindings and copy the associated Tensor from the Host to the Device, we can loop over the same map and depending on the current Residency either copy it from Host to Device, from another Device to Host and then to this Device (or initiate a P2P transfer if that is supported) or do nothing. Putting this code in the DeviceManager means we could have common code to do this through DM interfaces which could be overridden in the event a particular backend needs to do something else.

Copying lazily means that we only attempt to copy Tensors that are used, and so always do the least amount of work no matter the situation. By allowing the DeviceManager to handle any Residency for input Tensors we also unlock the ability to do the transfer early if it is more efficient to do so. One example of this would in the case of a device that can do PCI transfers concurrently with execution (this is common) we could transfer inputs of Partition B to the device while Partition A is executing (or not, depending on what the runtime situation happens to be).

I think the lazy solution provides us the best combination of simplicity, performance and extensibility. The implementation might look like this (assuming P2P is not supported):

// In DeviceManager::execute()
    {
    TRACE_EVENT_SCOPE(context.get(), TraceLevel::RUNTIME, "loadPlaceholders");
    // Copy over placeholders.
    for (auto PH : bindings->pairs()) {
      auto it = symbolTable.find(PH.first->getName());
      if (it == symbolTable.end()) {
        continue;
      }
      auto &symbolInfo = it->second;
      auto *hostTensor = PH.second;
      const DeviceResidencyInfo &info = hostTensor->getDeviceResidency();

      if (!info.deviceManager) {
        // If resident on no device, make it resident on this device.
        transferToDevice(*hostTensor);
      } else if (info.deviceManager != this) {
        // If resident on another device transfer it via the host (no p2p).
        info.deviceManager->transferFromDevice(*hostTensor);
        transferToDevice(*hostTensor);
      }
      // else do nothing, it's in the right place.
    }
  }

Obtaining final outputs

We have concentrated so far on input and activation tensors, with the plan to leave outputs on the device as long as possible. We still need to do the savePlaceholders stage to pull results down from the device, but potentially only once; at the end of the execution. The implementation would be the inverse of the code above, but perhaps in the Executor or HostManager.

This covers existing behaviour but one of the key use cases of Device Residency is when you want to keep outputs of a network on the device between runs (such as weights in the training case). We don’t always want to transfer the outputs back to the host.

In the same way we have the choice to transfer eagerly or lazily:

I like the lazy approach here as well for the same reasons. It makes decisions simpler inside the runtime (by removing decision making) and allows a wider range of configurations.

Take for example a training set up where some results are used every iteration (e.g. forward pass outputs for loss calculation), some results are used every N iterations (e.g. combining certain weights at the end of each epoch in a multi-device training run), and some results are used only at the end of the full run (e.g. the final trained weights). This is complex to express statically which would be needed for the eager approach.

Of course, it’s not strictly necessary that the “Copy-back” flag on Placeholders be static, it could be modified by the caller before each run. However, since Placeholders are held by a pointer and used for multiple pipelined (or in the case of multiple devices, concurrent iterations), this would create a great deal of extra complexity (at least) and would require synchronized storage inside the Placeholder.

Laziness does introduce a further burden on the user of Glow: they must now consider whether the Tensors they receive back from a run are resident on the Host or a Device. We can’t automatically transfer back to the Host in all cases to support situations like the previous training example, but we can provide helpers to make this easier:

class HostManager {
  ...

  void ensureOutputsAvailable(ExecutionContext& context) {
     for (const auto& PH : context.getPlaceholderBindings()) {
        const DeviceResidencyInfo &info = PH.second->getDeviceResidency();
        if (info.deviceManager != nullptr) {
          info.deviceManager->transferToHost(PH.second);
        }
     }
  }
}

This would make transferring all results back to the host a one liner (we’d need this call on the ExecutionEngine as well). Since this is likely the most common case (especially for simple examples and unit tests) we could leave it the default behavior and add a configuration option to the ExecutionContext to disable the final transfer.

Pinned Tensors

Moving on from Tensors that exist on the device, we have the case of Pinned Tensors: Tensors that are resident on the Host, but in a potentially opaque Device-specific format. The information we’d need to store for these tensors is very similar to that we need for Resident Tensors:

Clearly, this could map directly to the DeviceResidencyInfo structure defined above. This could increase complexity, since we’ll need additional state to know whether a Device Specific is actually on the device or not. On the other hand, perhaps the distinction is not useful to us since we don’t (or can’t) manipulate the contents of pinned Tensors (e.g. because we don’t understand their alignment).

If all Backends need their Pinned Tensors configured from some small set of parameters then we may be able to represent this by a combination of core Glow features (e.g. by adding alignment and stride to Tensors, as most frameworks have). This could be simpler and more powerful in the short term, but adding a new parameter type causes much more churn.

In particular, if a Backend requires Pinned Tensors to be stored with some implementation specific metadata (such as a HostResource or BackendTensorInfo, etc) then an attempt by Glow to manage these Device Specific Tensors will result in duplication, at least in code but also potentially in work on the CPU.

Ultimately the most flexible and future proof infrastructure we can build is the ability for each Backend to allocate and free Pinned Tensors via whatever mechanism they use, and determine whether or not a Tensor is correctly pinned at the time they initiate a DMA transfer.

Proposing we can add the following to DeviceManager:

class DeviceManager {
  ...

  /* Allocates a pinned Tensor on the host ready for DMA to this device. */
  Tensor allocatePinnedTensor(TypeRef type);

  /* Releases a pinned host Tensor, freeing its memory to be reused. */ 
  void releasePinnedTensor(Tensor& tensor);
}

Similar to Device Resident Tensors above, these interfaces support an implementation that uses object pooling via the standard glow::TensorPool if that’s what we want to do for that particular DeviceManager.

The second component required to realize the performance win is the ability for the user of the Runtime to allocate Device Specific Tensors up front, avoiding a copy from a standard Host based Tensor into a Pinned Tensor. This could be achieved by a simple interface on the HostManager:

class HostManager {
  ...

  /* Allocates each Placeholder in the \p context's PlaceholderBindings with a 
     correctly pinned Tensor for the Function corresponding to \p networkName.
     If that Placeholder is already allocated (but not correctly pinned) it's 
     content will be copied to the new Tensor. */
  void allocatePinned(`llvm::StringRef networkName, `ExecutionContext& context);
}

In this way, the user does not need to know what the specifics of their pinned Tensors are and can trust the Glow runtime to do the right thing under the hood.

The DeviceManager’s transferToDevice will need to handle pinned and unpinned Tensors appropriately of course, but that is already true.

Interaction with Peer-to-Peer

The diagram below lays out three approaches to handling P2P data transfers on an input graph that is partitioned into two pieces (P1 and P2) on two devices:

image

Note: size of areas is related primarily to how much text had to go in them, not intended to show representative time costs.

The leftmost sequence shows the current approach without P2P (and without Device Residency). The Executor (here as Host) triggers each partition, and inside the DeviceManager we allocate any device tensors, copy Host Tensors to the device, execute the function and finally copy Device Tensors back to the host.

The middle sequence shows a P2P design involving send (and potentially recv) Nodes. In this case both devices allocate and load their Device Tensors at the start of the run, then the P1 Function initiates a data transfer into the previously allocated Device Tensor on Device 2. Execution of P2 may be triggered by the host or directly as a result of the P2P transfer.

The rightmost sequence shows a P2P design using Device Transfers. In this case the DeviceManager does the same thing in each case in it’s setup/loadTensors phase: for each Tensor it does either a Host-to-Device transfer, a Device-to-Device transfer (P2P) or nothing depending on the current location of that Tensor. Since we no longer need Tensors to exist on the host the saveTensors phase is omitted llvm::StringRef networkName the Host transfers back any Tensors it needs at the end of the run (ie. the networks outputs).

I like this last approach (at least in the short/medium term) because it removes the need to distinguish between devices that support P2P and those that don’t, we can just “do the right thing” inside the DeviceManager and everything works out.

Interaction with Tensor Streaming

Similarly to how we could model pinning via a Device Specific component, we could model Tensors that require streaming from the Network as being resident on a special kind of device (that can’t execute graphs, of course).

I don’t have a strong sense of whether this will work, or make sense in our system. Device Residency is a concept that is relevant only at run time, whereas streamed Tensors are currently only a compile time concern. It may be that they never overlap in handling code and so would benefit from a completely separate infrastructure.

Conclusion

By designing good abstractions I believe we can support Device Residency and Device Specificity (pinning) of Tensors with limited changes to core Glow. Regardless of the details, I’m reasonably confident that any approach we take here must be both dynamic and provide the ability to include device specific logic or configuration.

I’ve prototyped a (very) rough version of this approach with Lazy transfers and metadata stored in the Tensor. This is functional and in the DLRM tracing example does greatly reduce the amount of transfers to and from the device without affecting the result.

Streaming large Tensors and support for Peer to Peer are currently less well understood, and while these changes may give us good tools for implementing that functionality they may also not be a good fit.

opti-mix commented 4 years ago

@nickgg Do you plan more work on this issue or can it be closed?