openproblems-bio / openproblems-v2

Formalizing and benchmarking open problems in single-cell genomics
MIT License
50 stars 19 forks source link

Revising open problem task structure #292

Open mumichae opened 7 months ago

mumichae commented 7 months ago

Is your feature request related to a problem? Please describe. After looking over the task structure again, I feel like it might be somewhat restrictive to a certain type of task. For supervised problems a train/test (or dataset/solution) split might work as a good abstraction of the problem, however for unsupervised tasks this might not work out as nicely. Good examples of such tasks are the batch integration and the spatial decomposition tasks, where you might have 2 different inputs to a metric (for batch integration) or method (for spatial decomposition), but those might not quite fit into the paradigm of the train/test split.

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    solution[Solution]:::anndata
    masked_data[Dataset]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> masked_data & solution
  masked_data --- method --> output
  masked_data & solution --- control_method --> output
  solution & output --- metric --> score

Describe the solution you'd like More flexible paradigm that would make it easier for users to conceptualize their workflow.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

A solution could be something closer to this:

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    solution[Solution optional]:::anndata
    masked_data[Task input 1, ..., N]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> masked_data & solution
  masked_data --- method --> output
  masked_data & solution --- control_method --> output
  solution & output --- metric --> score

Alternatively, we could show multiple task workflows depending on the type of task at hand (supervised, unsupervised, multiple inputs etc.) to allow for different task setups.

Additional context For the spatial decomposition task we require not just 2 anndata inputs (one of which we can consider to be the solution) for the method and the metrics. The reference matrix (aka "solution") is used both for the methods and the metrics, which might not quite fit into the paradigm the the solution should not be seen by the method. Additionally, in the original task, there are multiple different versions of the reference matrix , which could arguably be considered as a separate dataset, part of the data processing or part of the method. An example workflow would look like this:

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    task_input_1[Reference matrix]:::anndata
    task_input_2[Spatial matrix]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> task_input_1 & task_input_2
  task_input_1 & task_input_2 --- method --> output
  task_input_1 & task_input_2 --- control_method --> output
  task_input_1 & output --- metric --> score

Or separating the solution and the second input:

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    task_input_1[Reference matrix]:::anndata
    task_input_2[Spatial matrix]:::anndata
    solution[Solution]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> task_input_1 & task_input_2 & solution
  task_input_1 & task_input_2 --- method --> output
  task_input_1 & task_input_2 & solution--- control_method --> output
  solution & output --- metric --> score

For batch integration a more intuitive structure would be:

graph LR
  common_dataset[Common<br/>dataset]:::anndata
  subgraph task_specific[Task-specific workflow]
    dataset_processor[/Dataset<br/>processor/]:::component
    task_input_1[Dataset]:::anndata
    method[/Method/]:::component
    control_method[/Control<br/>method/]:::component
    output[Output]:::anndata
    metric[/Metric/]:::component
    score[Score]:::anndata
  end
  common_dataset --- dataset_processor --> task_input_1
  task_input_1 --- method --> output
  task_input_1 --- control_method --> output
  task_input_1 & output --- metric --> score
rcannood commented 7 months ago

For reference, the Batch Integration task would look like this:

flowchart LR
  file_common_dataset("Common Dataset")
  comp_process_dataset[/"Data processor"/]
  file_dataset("Dataset")
  comp_control_method_embedding[/"Control method (embedding)"/]
  comp_control_method_graaf[/"Control method (graph)"/]
  comp_method_embedding[/"Method (embedding)"/]
  comp_method_feature[/"Method (feature)"/]
  comp_method_graaf[/"Method (graph)"/]
  comp_metric_embedding[/"Metric (embedding)"/]
  comp_metric_feature[/"Metric (feature)"/]
  comp_metric_graaf[/"Metric (graph)"/]
  file_integrated_embedding("Integrated embedding")
  file_integrated_graaf("Integrated Graph")
  file_integrated_feature("Integrated Feature")
  file_score("Score")
  comp_transformer_embedding_to_graaf[/"Embedding to Graph"/]
  comp_transformer_feature_to_embedding[/"Feature to Embedding"/]
  file_common_dataset---comp_process_dataset
  comp_process_dataset-->file_dataset
  file_dataset---comp_control_method_embedding
  file_dataset---comp_control_method_graaf
  file_dataset---comp_method_embedding
  file_dataset---comp_method_feature
  file_dataset---comp_method_graaf
  file_dataset---comp_metric_embedding
  file_dataset---comp_metric_feature
  file_dataset---comp_metric_graaf
  comp_control_method_embedding-->file_integrated_embedding
  comp_control_method_graaf-->file_integrated_graaf
  comp_method_embedding-->file_integrated_embedding
  comp_method_feature-->file_integrated_feature
  comp_method_graaf-->file_integrated_graaf
  comp_metric_embedding-->file_score
  comp_metric_feature-->file_score
  comp_metric_graaf-->file_score
  file_integrated_embedding---comp_metric_embedding
  file_integrated_embedding---comp_transformer_embedding_to_graaf
  file_integrated_graaf---comp_metric_graaf
  file_integrated_feature---comp_metric_feature
  file_integrated_feature---comp_transformer_feature_to_embedding
  comp_transformer_embedding_to_graaf-->file_integrated_graaf
  comp_transformer_feature_to_embedding-->file_integrated_embedding

Instead of what is currently listed in the readme:

flowchart LR
  file_common_dataset("Common Dataset")
  comp_process_dataset[/"Data processor"/]
  file_dataset("Dataset")
  file_solution("Solution")
  comp_control_method_embedding[/"Control method (embedding)"/]
  comp_control_method_graaf[/"Control method (graph)"/]
  comp_method_embedding[/"Method (embedding)"/]
  comp_method_feature[/"Method (feature)"/]
  comp_method_graaf[/"Method (graph)"/]
  comp_metric_embedding[/"Metric (embedding)"/]
  comp_metric_feature[/"Metric (feature)"/]
  comp_metric_graaf[/"Metric (graph)"/]
  file_integrated_embedding("Integrated embedding")
  file_integrated_graaf("Integrated Graph")
  file_integrated_feature("Integrated Feature")
  file_score("Score")
  comp_transformer_embedding_to_graaf[/"Embedding to Graph"/]
  comp_transformer_feature_to_embedding[/"Feature to Embedding"/]
  file_common_dataset---comp_process_dataset
  comp_process_dataset-->file_dataset
  comp_process_dataset-->file_solution
  file_dataset---comp_control_method_embedding
  file_dataset---comp_control_method_graaf
  file_dataset---comp_method_embedding
  file_dataset---comp_method_feature
  file_dataset---comp_method_graaf
  file_solution---comp_metric_embedding
  file_solution---comp_metric_feature
  file_solution---comp_metric_graaf
  comp_control_method_embedding-->file_integrated_embedding
  comp_control_method_graaf-->file_integrated_graaf
  comp_method_embedding-->file_integrated_embedding
  comp_method_feature-->file_integrated_feature
  comp_method_graaf-->file_integrated_graaf
  comp_metric_embedding-->file_score
  comp_metric_feature-->file_score
  comp_metric_graaf-->file_score
  file_integrated_embedding---comp_metric_embedding
  file_integrated_embedding---comp_transformer_embedding_to_graaf
  file_integrated_graaf---comp_metric_graaf
  file_integrated_feature---comp_metric_feature
  file_integrated_feature---comp_transformer_feature_to_embedding
  comp_transformer_embedding_to_graaf-->file_integrated_graaf
  comp_transformer_feature_to_embedding-->file_integrated_embedding

In this specific case, it doesn't really look a lot simpler, even though it is.


As a side note, it would be nice if the subtasks were grouped like this:

flowchart LR
  file_common_dataset("Common Dataset")
  comp_process_dataset[/"Data processor"/]
  file_dataset("Dataset")
  subgraph feature[Feature]
    comp_method_feature[/"Method (feature)"/]
    comp_metric_feature[/"Metric (feature)"/]
    file_integrated_feature("Integrated Feature")
  end
  comp_transformer_feature_to_embedding[/"Feature to Embedding"/]
  subgraph embedding[Embedding]
    comp_control_method_embedding[/"Control method (embedding)"/]
    comp_method_embedding[/"Method (embedding)"/]
    comp_metric_embedding[/"Metric (embedding)"/]
    file_integrated_embedding("Integrated embedding")
  end
  comp_transformer_embedding_to_graaf[/"Embedding to Graph"/]
  subgraph graph[Graph]
    comp_control_method_graaf[/"Control method (graph)"/]
    comp_method_graaf[/"Method (graph)"/]
    comp_metric_graaf[/"Metric (graph)"/]
    file_integrated_graaf("Integrated Graph")
  end
  file_score("Score")
  file_common_dataset---comp_process_dataset
  comp_process_dataset-->file_dataset
  file_dataset---comp_control_method_embedding
  file_dataset---comp_control_method_graaf
  file_dataset---comp_method_embedding
  file_dataset---comp_method_feature
  file_dataset---comp_method_graaf
  file_dataset---comp_metric_embedding
  file_dataset---comp_metric_feature
  file_dataset---comp_metric_graaf
  comp_control_method_embedding-->file_integrated_embedding
  comp_control_method_graaf-->file_integrated_graaf
  comp_method_embedding-->file_integrated_embedding
  comp_method_feature-->file_integrated_feature
  comp_method_graaf-->file_integrated_graaf
  comp_metric_embedding-->file_score
  comp_metric_feature-->file_score
  comp_metric_graaf-->file_score
  file_integrated_embedding---comp_metric_embedding
  file_integrated_embedding---comp_transformer_embedding_to_graaf
  file_integrated_graaf---comp_metric_graaf
  file_integrated_feature---comp_metric_feature
  file_integrated_feature---comp_transformer_feature_to_embedding
  comp_transformer_embedding_to_graaf-->file_integrated_graaf
  comp_transformer_feature_to_embedding-->file_integrated_embedding