[NeoML] Remove excess CUDA syncs in layers

Please, merge before:

The idea behind eliminating unnecessary synchronizations for CUDA is that scalar constants can be passed to GPU computation kernels from host memory by value.

It would be possible to replace arguments that imply scalar constants in math engine methods with float or int types. But then, if such operations would use the result of a previous operation (for example, you often need to multiply by the result of a scalar product), you would have to add additional synchronization to obtain the result from the device memory to the host memory.

To exclude both synchronization options, the wrapper class CScalarPararmeter<T> is used, which contains both a scalar constant and a handler-pointer to the device’s memory as fields. The value of a scalar parameter can lie in the only its field (but not both at the same time), depending on which constructor was called initially. CScalarPararmeter<T> is instantiated by two types: float and int.

All constructors of the CScalarPararmeter<T> wrapper are implicit. Therefore, it can easily be constructed itself, both from the value of a scalar constant in the host memory, and from a handler-pointer to the device’s memory. This design will allow you to avoid compilation errors while merging to the OCRT.

To eliminate unnecessary synchronizations for CUDA and in the OCRT module, you will have to manually transfer scalar constants that were previously constructed using any CTypedHandleStackVar<T> directly to the method of the mathematical engine in all places where scalar constants are used, which is more natural and readable.

neoml-lib / neoml

[NeoML] Remove excess CUDA syncs in layers #1070