nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.7k stars 621 forks source link

task.index does not match the task.id when accessed in resourceLabels initialization closure #4455

Open dougnukem opened 10 months ago

dougnukem commented 10 months ago

Bug report

I'm trying to add resourceLabels to each process to track a Task ID so I can associate to a particular workflow task/process from my trace log / execution.

nextflow.config


def formatResourceLabelValue(value) {
  // Filter map to only be valid Google Cloud label values
  // https://cloud.google.com/compute/docs/labeling-resources#requirements
  // Keys and values can contain only lowercase letters, numeric characters, underscores, and dashes. 
  // All characters must use UTF-8 encoding, and international characters are allowed.
  // Values can be empty, and have a maximum length of 63 characters.
  value = value.toString()
  value = value.toLowerCase()
  value = value.replaceAll(~/:/, "_")
  // remove any non lower alpha-numeric or dash or underscores
  value = value.replaceAll(~/[^a-z0-9\-_]/ , "")
  // truncate to 63 characters
  return value.take(63)
}

process {
  // apply ALL processes/jobs in Workflow
  resourceLabels = {
    def taskResourceLabels = [:]

    // Add per task details like name, id, index
    taskResourceLabels["nf-workflow-process"] = formatResourceLabelValue(task.process)
    taskResourceLabels["nf-workflow-task-index"] = formatResourceLabelValue(task.index)
    taskResourceLabels["nf-workflow-task-hash"] = formatResourceLabelValue(task.hash)

    // task.id is null in this context?
    // println("task.id: " + task.id)
    // println("task.hash: " + task.hash)

    return taskResourceLabels
  }
}

When trying to use this in a process.resourceLabels closure function (e.g. to tag cloud resources with the workflow task index/id). It appears the task.index is set to the index of the process being run NOT the task_id

In the documentation it states that the: Process implicit variables

The following variables are implicitly defined in the task object of each process:

...

The task unique hash ID

index The task index (corresponds to task_id in the execution trace)

But it appears the task.index corresponds to the index of the task that's being run (it happens to be the same ID if the workflow contains only sequential steps).

It also appears that task.id is null in this context?

e.g.

trace.txt

task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    peak_rss    peak_vmem   rchar   wchar
4   f7/722432   17158399944915787383    processA (2)    COMPLETED   0   2023-10-29 15:01:17.990 3m 25s  1.2s    1290.5% 9 MB    16.9 MB 955.4 MB    359.4 MB

The labels set on the task / process are as follows:

nf-workflow-process=processA
nf-workflow-task-index=2
nf-workflow-task-hash=f7722432ee497744632e0ee2c234310c

Expected behavior and actual behavior

Expectation is that either:

task.hash could be used but it also appears that the hash output by the trace table is a truncated hash (is there a way to get the trace to output the full hash)

Steps to reproduce the problem

(Provide a test case that reproduce the problem either with a self-contained script or GitHub repository)

Program output

(Copy and paste here output produced by the failing execution. Please highlight it as a code block. Whenever possible upload the .nextflow.log file.)

Environment

Additional context

(Add any other context about the problem here)

bentsherman commented 10 months ago

Task index and id are not the same, the index is the the order within the process where as the id is the order within the entire pipeline

Also, the task hash will be null because it hasn't been computed yet

lishengting commented 8 months ago

Task index and id are not the same, the index is the the order within the process where as the id is the order within the entire pipeline

Also, the task hash will be null because it hasn't been computed yet

@bentsherman hi, what do you mean by "within the process"? There should be only one index for each process, right?

bentsherman commented 8 months ago

In other words, every task belongs to the pipeline run but also to a particular process. If you have two processes FOO and BAR and they each generate one task, and let's say the FOO task is generated first, both tasks will have task.index of 1 because they were the first tasks generated within their process. But the FOO task will have an id of 1 and the BAR task an id of 2, because the FOO task was generated first across the entire pipeline run.