lab-cosmo / metatensor

Self-describing sparse tensor data format for atomistic machine learning and beyond
https://docs.metatensor.org
BSD 3-Clause "New" or "Revised" License
44 stars 14 forks source link

`metatensor-operation` utilities to simplify creation of tensor maps similar to the `*_like` for tests and playing around #451

Open agoscinski opened 5 months ago

agoscinski commented 5 months ago

The goal is to reduce duplicate code in the tests due to the creation of tensor maps, for example in

def random_uniform_tensor(
        nb_blocks: int,
        shape: Union[Tuple[int, int], Tuple[int, int, int]],
        nb_gradients_per_block: Optional[int] = None,
        dtype: Optional[Union[np.dtype, torch.dtype] = None,
        device: Optional[torch.dtype] = None
    ):
      """
      Helper constructor to create easily tensor maps with uniform random
      initialization of the values. The labels are ranges from 0 to number of elements.
      The names for the key is ... and for the samples ...

      :param nb_blocks:
          Specifies the number of blocks  of the tensor map.
      :param shape:
          Specifies the shape of the values of each block
      :param nb_gradients_per_block:
          Number of gradients per block, by default no gradients are added.
      :param dtype:
          defines backend (numpy or torch) and dtype of the values of the blocks
      :param device:
          Only important for torch.tensors, on which device the exist

      """
      # create tensor map according to input
      tensor_map = ...
      return random_uniform_like(tensor_map)

tensor(3,1)
random_uniform_tensor(2, (5,2), 1)

I also thought about more flexibility, but I think supporting multi column labels make it more complicated without much gain in usage. If you need labels with more than one column you usually have a specific use case in mind and need to specify the exact Labels.

def random_uniform_tensor(
        nb_blocks: int,
        shape: Union[Tuple[int, int], Tuple[int, int, int]],
        nb_gradients_per_block: Optional[int] = None,
        nb_gradients_per_sample: Optional[int] = None,
        dim_labels_key: int = 1,
        dim_labels_samples: int = 1
        dim_labels_components: Optional[int] = None,
        dim_labels_properties: int = 1,
        dtype: Optional[Union[np.dtype, torch.dtype] = None,
        device: Optional[torch.dtype] = None
    ):

Do you think such an operation would be useful in the first suggested form?

Luthaf commented 5 months ago

I'm not really sold on something like this ... From the example, I feel it makes reading the code a lot harder: as a new contributor if I see something like random_uniform_tensor(2, (5,2), 1) I have no idea what is happening.

Is the intended use case to make writing tests easier? Or to provide a new operation for everyone to use? In both cases, we already have block_from_array, which is not used that much in the existing tests but is already intended as an easier way to wrap data in metatensor.

We could provide a tensor_from_arrays(keys: Labels, blocks: List[Array]) -> TensorMap that would do something similar to your suggestion, but I doubt it would be used a lot:

agoscinski commented 5 months ago

for tests we need to add more edge cases with samples present in some but not all blocks, etc.

Yes but that also means that there is some generic constructions of a tensor map that every test seems to need. In every subpackage we have a utils.pyin with some arbitrary complicated tensor map definition that is documented with "dummy tensor map". The test-specific edge case constructions are usually in the test file itself. Something with gradients is needed that can also used for all subpackages as base test. I am also okay for now to use always one of the stored tensor map ./python/metatensor-operations/tests/data/qm7-power-spectrum.npz, but the utility to load them in a dispatched way is also only in the metatensor-torch subpackage, so then I need to copy this to metatensor-learn. By having something like this in operations, every subpackage can use it and code redundancy is extremely reduced.

for users putting their data in metatensor, they will want more control over the samples and components names & values

The current from_array method and the one you suggested don't include gradients and I have found myself several times spending time in constructing a tensor map because I wanted to play around with gradients. I think at least one utility operation should simplify the creation of a tensor map with gradients. I am open for other suggestions, but the tensor_from_array does not solve it.

Luthaf commented 5 months ago

So my main point is that if this is only intended for tests, we should not put it in the main API. I'd be fine with something like metatensor.operations.test_utils.random_tensor (where one would have to explicitly import metatensor.operations.test_utils to access it). I'm not convinced this will be used by a lot of tests, but I'm happy to give it a go!

I feel like this would be plenty for the API:

def random_tensor(
    *, # force all arguments to be keywords args
    n_blocks: int, 
    shape: Optional[List[int]] = None,  # defaults to 8x3 or something
    gradients: Optional[List[str]] = None,
) -> TensorMap

I think device and dtype should be handled by metatensor.to().