viash-io / viash

script + metadata = standalone component
https://viash.io
GNU General Public License v3.0
39 stars 2 forks source link

Add `type: h5ad_file` and `type: h5mu_file` #261

Open rcannood opened 1 year ago

rcannood commented 1 year ago

Having extra metadata you can add to denote the interface of an h5ad and h5mu file can be used later on for checking whether the files adhere to a certain interface.

Right now a viash config doesn't allow for specifying the schema of an h5ad file:

functionality:
  name: dataset_preprocessing
  arguments:
    - name: "--input"
      type: file
      description: "An unprocessed dataset."
      example: "input.h5ad"
    - name: "--output"
      type: file
      description: "A preprocessed dataset"
      example: "output.h5ad"
      direction: output

Example of proposed functionality:

functionality:
  name: dataset_preprocessing
  arguments:
    - name: "--input"
      type: h5ad_file
      description: "An unprocessed dataset."
      example: "input.h5ad"
      slots:
        layers: 
          - type: integer
            name: counts
            description: Raw counts
        obs:
          - type: double
            name: labels
            description: Ground truth cell type labels
          - type: double
            name: batch
            description: Batch information
        uns:
          - type: string
            name: raw_dataset_id
            description: "A unique identifier for the original dataset (before preprocessing)"
    - name: "--output"
      type: h5ad_file
      description: "A preprocessed dataset"
      example: "output.h5ad"
      slots:
        layers: 
          - type: integer
            name: counts
            description: Raw counts
          - type: double
            name: lognorm
            description: Log-transformed normalised counts
        obs:
          - type: double
            name: labels
            description: Ground truth cell type labels
          - type: double
            name: batch
            description: Batch information
        uns:
          - type: string
            name: raw_dataset_id
            description: "A unique identifier for the original dataset (before preprocessing)"
          - type: string
            name: dataset
            description: "A unique identifier for the dataset"

I suggest these fields don't have a functional impact at the moment.

rcannood commented 1 year ago

Initial implementation, reusing existing arguments:

case class H5adFileArgument(..., slots: H5adSlots) extends AbstractFileArgument
case class FileArgument(...) extends AbstractFileArgument
case class IntegerArgument extends BaseArgument[Int]

abstract class AbstractFileArgument extends Argument[Path]
abstract class BaseArgument[Type] extends Argument[Type]

// None of these arguments should have examples, defaults, directions, multiple, multiple_sep defined
case class H5adSlots(
  X: Option[BaseArgument[_]],
  layers: List[BaseArgument[_]],
  obs: List[BaseArgument[_]],
  obsp: List[BaseArgument[_]],
  obsm: Map[String, BaseArgument[_]],
  var: List[BaseArgument[_]],
  varp: List[BaseArgument[_]],
  varm: Map[String, BaseArgument[_]],
  uns: List[BaseArgument[_]]
)

This doesn't work, because you need to be able to add Lists and Data Frames to uns and Data frames to obsm and varm.

--

Wip:

// new arguments
abstract class AbstractFileArgument extends Argument[Path]
case class H5adFileArgument(..., slots: H5adSlots) extends AbstractFileArgument
case class FileArgument(...) extends AbstractFileArgument

// helper classes for h5adslots
abstract class H5adValue {
  val `type`: String
  val name: String
  val description: Option[String]
  val required: Boolean // default: true
}
case class H5adIntegerValue(...) extends H5adValue { ... }
case class H5adDoubleValue(...) extends H5adValue { ... }
case class H5adLongValue(...) extends H5adValue { ... }
case class H5adStringValue(...) extends H5adValue { ... }
case class H5adBooleanValue(...) extends H5adValue { ... }
case class H5adDictValue(...,  values: List[H5adValue]) extends H5adValue { ... }

case class H5adSlots(
  X: Option[H5adValue],
  layers: List[H5adValue],
  obs: List[H5adValue],
  obsp: List[H5adValue],
  obsm: Map[String, H5adValue],
  var: List[H5adValue],
  varp: List[H5adValue],
  varm: Map[String, H5adValue],
  uns: List[H5adValue]
)
rcannood commented 1 year ago

I'm starting to have a lot of components which have arguments like this:

arguments:
  - name: "--output"
    type: file
    direction: output
    description: The output h5ad file.
    example: output.h5ad
    info:
      slots:
        obsm:
          - type: double
            name: X_pca
            description: The resulting PCA embedding.
            required: true
        varm:
          - type: double
            name: pca_loadings
            description: The PCA loadings matrix.
            required: true
        uns:
          - type: double
            name: pca_variance
            description: The PCA variance objects.
            required: true

  - name: "--obsm_embedding"
    type: string
    default: "X_pca"
    description: "In which .obsm slot to store the resulting PCA embedding."

  - name: "--varm_loadings"
    type: string
    default: "pca_loadings"
    description: "In which .varm slot to store the PCA loadings matrix."

  - name: "--uns_variance"
    type: string
    default: "pca_variance"
    description: "In which .uns slot to store the PCA variance objects."

The slots is used mostly for documentation and to automate gatekeeper components, but it's starting to become a hassle to manually type all of these slot arguments. type: h5ad_file could help with reducing some of the boilerplate code, that is:

arguments:
  - name: "--output"
    type: file
    direction: output
    description: The output h5ad file.
    example: output.h5ad
    slots:
      obsm:
        - type: double
          name: --output_embedding
          description: The resulting PCA embedding.
          required: true
          default: X_pca
      varm:
        - type: double
          name: --output_loadings
          description: The PCA loadings matrix.
          required: true
          default: pca_loadings
      uns:
        - type: double
          name: --output_variance
          description: The PCA variance objects.
          required: true
          default: pca_variance

The point being that by specifying the slots for an --output argument, Viash automatically adds the following arguments: --output_embedding, --output_loadings, --output_variance

rcannood commented 1 year ago

Comment by @tverbeiren :

I think that makes sense, I'm just a bit worried about the possible confusion: slots versus arguments. They look very similar now.

A suggestion: make the mapping to an argument more explicit (albeit optional):

   slots:
      obsm:
        description: The resulting PCA embedding.
        type: double
        maps_to_argument:
          name: --output_embedding
          required: true
          default: X_pca
...

The returning arguments could be handled using includes as well, no? Would that make sense or not?