Use labels for Galaxy workflow outputs

seek4science / seek

For finding, sharing and exchanging Data, Models, Simulations and Processes in Science.

http://www.seek4science.org

BSD 3-Clause "New" or "Revised" License

76 stars 52 forks source link

Use labels for Galaxy workflow outputs #904

Closed simleo closed 2 years ago

simleo commented 2 years ago

See this suggestion by @wm75: https://github.com/galaxyproject/iwc/issues/96

fbacall commented 2 years ago

Currently we're picking outputs from each step rather than workflow_outputs.

Is there a description of what workflow_outputs is, and whether it will always be present in a .ga file?

wm75 commented 2 years ago

@fbacall only workflow_outputs is what the name suggests. outputs is what individual tools produce, but the workflow selects with workflow_outputs what's considered relevant in its context. So for any workflow-centric approach, workflow_outputs is what should be used imo. Would you agree @mvdbeek ?

mvdbeek commented 2 years ago

Right, those are the top level outputs that are available in reports, as inputs to other subworkflow steps, they're highlighted in the UI, etc.

wm75 commented 2 years ago

Is there a description of what workflow_outputs is, and whether it will always be present in a .ga file?

Not sure about the first part, but each step in a .ga file will always have workflow_outputs. Its value might be an empty list though if that step does not have any of its outputs marked as workflow_outputs by the creator of the WF.

fbacall commented 2 years ago

What does the output_name of a workflow_output refer to?

wm75 commented 2 years ago

That's just a reference to the step's regular output, i.e. it says: the tool output with this name should become a workflow output.

wm75 commented 2 years ago

I think it's ok if that's parsed/displayed as the Name of the output in workflowhub since there isn't any readily available better alternative.

fbacall commented 2 years ago

OK so label will be the "ID", output_name will be the "Name". Do workflow_outputs ever have a type field like outputs do?

wm75 commented 2 years ago

No, that would just be redundant with the info in outputs.

fbacall commented 2 years ago

OK I'm confused then, because it seems sometimes there is no matching entry under outputs for an entry in workflow_outputs, e.g.

{
  "annotation": "Allele Frequency Filter. This is the minimum allele frequency required for variants to be included in the reports.",
  "content_id": null,
  "errors": null,
  "id": 0,
  "input_connections": {},
  "inputs": [
    {
      "description": "Allele Frequency Filter. This is the minimum allele frequency required for variants to be included in the reports.",
      "name": "AF Filter"
    }
  ],
  "label": "AF Filter",
  "name": "Input parameter",
  "outputs": [],
  "position": {
    "bottom": 414.7942708333333,
    "height": 46.3359375,
    "left": -421.578125,
    "right": -271.578125,
    "top": 368.4583333333333,
    "width": 150,
    "x": -421.578125,
    "y": 368.4583333333333
  },
  "tool_id": null,
  "tool_state": "{\"default\": 0.05, \"parameter_type\": \"float\", \"optional\": true}",
  "tool_version": null,
  "type": "parameter_input",
  "uuid": "2e5a5b38-c204-45a2-98e2-e113bce5a14b",
  "workflow_outputs": [
    {
      "label": null,
      "output_name": "output",
      "uuid": "e55d47c5-7ea0-45c4-8844-47bf29b88542"
    }
  ]
}

wm75 commented 2 years ago

Huh, a very good catch! This looks like a bug where a WF input (the step's type is parameter_input) has gotten turned into a WF output. Not sure how that has happened (maybe a Galaxy WF editor bug?), but it should be fixed in the iwc repo. One solution on your side would be to ignore steps with "type": "parameter_input" when looking for workflow_outputs.

mvdbeek commented 2 years ago

Not a bug, inputs are outputs. If you don't want to display them it is fine to skip them.

fbacall commented 2 years ago

how does this look?

wm75 commented 2 years ago

@fbacall Great!

Another option you may want to consider in the future would be to convert the .ga file to gxformat2 before trying to parse any information.

cat variation-reporting.gxwf.yml

class: GalaxyWorkflow
doc: This workflow takes a VCF dataset of variants produced by any of the variant
  calling workflows in https://github.com/galaxyproject/iwc/tree/main/workflows/sars-cov-2-variant-calling
  and generates tabular lists of variants by Samples and by Variant, and an overview
  plot of variants and their allele-frequencies.
label: 'COVID-19: variation analysis reporting'
tags:
- COVID-19
- covid19.galaxyproject.org
uuid: b08c744d-7c61-4b58-ac5f-4b5886c3c643
inputs:
  AF Filter:
    default: 0.05
    doc: Allele Frequency Filter. This is the minimum allele frequency required for
      variants to be included in the reports.
    optional: true
    position:
      bottom: 414.7942708333333

...

outputs:
  _anonymous_output_1:
    outputSource: AF Filter
  _anonymous_output_2:
    outputSource: DP Filter
  _anonymous_output_3:
    outputSource: DP_ALT Filter
  _anonymous_output_4:
    outputSource: Number of Clusters
  prefiltered_variants:
    outputSource: '6'
  filtered_variants:
    outputSource: '9'
  filtered_extracted_variants:
    outputSource: '10'
  filtered_and_renamed_effects:
    outputSource: 11/outfile_replace
  af_recalculated:
    outputSource: 12/out_file1
  collapsed_effects:
    outputSource: 13/out_file
  highest_impact_effects:
    outputSource: 14/outfile
  cleaned_header:
    outputSource: 15/outfile
  processed_variants_collection:
    outputSource: 16/outfile
  all_variants_all_samples:
    outputSource: 20/outfile
  variants_for_plotting:
    outputSource: 35/list_output_tab

So the WF inputs and outputs are nicely declared up front in that case.

wm75 commented 2 years ago

Minor complications are:

you don't have the output_name listed for steps with just a single tool output (like '6', '9', '10' in the above)
it's actually harder to recognize the WF outputs that are in fact input params (in the above it's easy because they are lacking labels - which they shouldn't - so they are the _anonymous_outputs)

wm75 commented 2 years ago

One more related thing I just spotted on workflowhub:

Screenshot from 2022-02-18 13-19-51

In the steps section you're repeating the Input Params. So like for the workflow outputs, you would probably want to ignore steps with "type": "parameter_input".

fbacall commented 2 years ago

Trying a different approach of converting native > gxformat2 > CWL, parsing that, then supplementing the steps because otherwise they're pretty bare:

Inputs

<div class="table-responsive">

ID	Name	Description	Type
AF Filter	n/a	Allele Frequency Filter. This is the minimum allele frequency required for variants to be included in the reports.	float
DP Filter	n/a	Depth Filter. This is the minimum depth of all alignments at a variant site.	int
DP_ALT Filter	n/a	Depth Filter for variant allele. This is the minimum depth of alignments supporting a variant.	int
Number of Clusters	n/a	Number of Clusters to use in Variant Frequency Plot.	int
Variation data to report	n/a	Variation data in VCF format. Can be the output of any of the workflows in https://github.com/galaxyproject/iwc/tree/main/workflows/sars-cov-2-variant-calling	array containing File
gene products translations	n/a	A custom tabular file mapping NCBI RefSeq Protein identifiers as used by snpEff version 4.5covid19 to their commonly used names. Can be obtained from https://doi.org/10.5281/zenodo.4555734	File

Steps

<div class="table-responsive">

ID	Name	Description
6	SnpSift Filter	toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_filter/4.3+t.galaxy1
7	Compose text parameter value	toolshed.g2.bx.psu.edu/repos/iuc/compose_text_param/compose_text_param/0.1.1
8	Compose text parameter value	toolshed.g2.bx.psu.edu/repos/iuc/compose_text_param/compose_text_param/0.1.1
9	SnpSift Filter	toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_filter/4.3+t.galaxy1
10	SnpSift Extract Fields	toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_extractFields/4.3+t.galaxy0
11	Replace column	toolshed.g2.bx.psu.edu/repos/bgruening/replace_column_by_key_value_file/replace_column_with_key_value_file/0.2
12	Compute	toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6
13	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
14	Replace	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3
15	Replace	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3
16	Replace	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3
17	Collapse Collection	toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/5.1.0
18	Compute	toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6
19	Compute	toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/1.6
20	Replace	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3
21	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
22	Filter	Filter1
23	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
24	Join	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/1.1.2
25	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
26	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
27	Datamash	toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0
28	Join	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/1.1.2
29	Join	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/1.1.2
30	Cut	Cut1
31	Join	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/1.1.2
32	Cut	Cut1
33	Replace	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3
34	Cut	Cut1
35	Split file	toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0
36	Variant Frequency Plot	toolshed.g2.bx.psu.edu/repos/iuc/snpfreqplot/snpfreqplot/1.0+galaxy3
37	Sort	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sort_header_tool/1.1.1
38	Sort	toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sort_header_tool/1.1.1

Outputs

<div class="table-responsive">

ID	Name	Description	Type
_anonymous_output_1	n/a	n/a	File
_anonymous_output_2	n/a	n/a	File
_anonymous_output_3	n/a	n/a	File
_anonymous_output_4	n/a	n/a	File
af_recalculated	n/a	n/a	File
all_variants_all_samples	n/a	n/a	File
by_variant_report	n/a	n/a	File
cleaned_header	n/a	n/a	File
collapsed_effects	n/a	n/a	File
combined_variant_report	n/a	n/a	File
filtered_and_renamed_effects	n/a	n/a	File
filtered_extracted_variants	n/a	n/a	File
filtered_variants	n/a	n/a	File
highest_impact_effects	n/a	n/a	File
prefiltered_variants	n/a	n/a	File
processed_variants_collection	n/a	n/a	File
variant_frequency_plot	n/a	n/a	File
variants_for_plotting	n/a	n/a	File