jonysy / parenchyma

An extensible HPC framework for CUDA, OpenCL and native CPU.
75 stars 4 forks source link

Transfer Matrix #23

Open drahnr opened 7 years ago

drahnr commented 7 years ago

There is the need to handle transfers between devices more easily.

The current attempt to sync from backend to another is not sufficient/does not scale with more backends.

There are two things to think of (and fallback each): 1) Inter Framework 2) Fallback to do a Framework A -> Native -> Framework B 3) Inter Device (if the framework does not handle it, i.e. CUDA afaik) 4) Fallback to do a Framework A/Device A -> Native -> Framework A/Device B

Note that the matrix is supposedly symmetrical, but the transfer functions are not identical! Read is not write after all.

Note that this allows to scale very quickly, basically if this becomes a bottleneck, special functions can be registered. If not, and host memory is sufficient, this will default.

Note that: To and from Native is obviously always populated.

Note that: Maybe a big framework matrix is best suited, and then, if necessary a inter device matrix within the framework.

jonysy commented 7 years ago

In addition:

The body of the SharedTensor::autosync method contains the logic in question.

@alexandermorozov's original comment:

Backends may define transfers asymmetrically; for example, CUDA may know how to transfer to and from Native backend, while Native may know nothing about CUDA at all. So if the first attempt fails, we change the order and try again.

Removing that would require moving the logic to the Sync implementations, which could increase complexity. Although that's a disadvantage, the advantage of transferring the responsibility to frameworks would make adding other frameworks less of a hassle as the core codebase wouldn't need to be aware of individual frameworks (i.e., transferring from CUDA to OpenCL).

This may be a case of over-engineering, though. Transferring from framework-x to framework-y is rarely, if ever, done.