nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 624 forks source link

Using custom objects with paths #2085

Open nh13 opened 3 years ago

nh13 commented 3 years ago

Now that we can pass around custom (value/metadata) classes and have them cached, it would be wonderful if Nextflow could recognize Path members/properties so they are localized and available for the process that uses the class. For example:

import nextflow.io.ValueObject

@ValueObject
class SampleData {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    Path r1
    /** The path to the read two FASTQ */
    Path r2
}

There could be a way annotate r1 and r2 (or return the list of paths etc) for nextflow to localize them.

mjhipp commented 3 years ago

I wonder if it would be possible by using a different type as a member of the ValueObject? I am not very familiar with the source code, so I assume this wouldn't actually work, but something like:

import nextflow.io.ValueObject
import nextflow.processor.TaskPath

@ValueObject
class SampleData {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    TaskPath r1
    /** The path to the read two FASTQ */
    TaskPath r2
}

Or maybe nextflow.file.FileHolder?

pditommaso commented 3 years ago

Would just String be enough for paths? internal classes are not meant to be used in pipeline scripts.

nh13 commented 3 years ago

I'd love to have things properly typed (eg. Path).

pditommaso commented 3 years ago

In principle Path should work as well. What's the problem with it?

nh13 commented 3 years ago

Does nextflow inspect the object for members of type Path and localize already?

pditommaso commented 3 years ago

What do you mean for localize?

nh13 commented 3 years ago

If I pass in a SampleData class in the script, the r1

process foo {
  input:
     val sample // Type: SampleData
     path fasta
  output:
    ???
  script:
    """
    bwa mem ${fasta} ${sample.r1} ${sample.r2}
    """
}
mjhipp commented 3 years ago

Files being localized from the working directory to the analysis node (mainly for cloud workflows)

pditommaso commented 3 years ago

we are entering in a minefield .. 😄 no, the file staging (ie localisation) is determined by the path declaration. The pattern you are suggesting is interesting but not supported.

ryankelley-lm commented 3 years ago

There's a variety of situations where this capability would make it less error prone to deal with collections of genomic files, including VCF (+ index), BAM (+ index), FASTA (+ index). In the above example, you could imagine additional behaviors such as "is_paired" to allow scripts to handle single- or paired-end data in a more expressive way. Is this something that would completely break the process handling design or is there an opportunity for a user contribution in this area?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nh13 commented 2 years ago

Still relevant, here’s what I do:

import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper

@ValueObject
@Immutable(copyWith=true, knownImmutables = ['path'])
class Metadata {
    static { 
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channles. This allows for caching when this object is used.
        KryoHelper.register(Metadata)
    }

    Path path
}
edmundmiller commented 2 years ago

@nh13 Do you have an example of a pipeline you've used that in?

nh13 commented 2 years ago

@Emiller88 not in one I can share publicly, happy to help with a small hello world example.

edmundmiller commented 2 years ago

@Emiller88 not in one I can share publicly, happy to help with a small hello world example.

I figured :disappointed:. Not sure if that was you offering to write it or answer questions.

I'm guessing you toss the KryoHelper class in a groovy file in lib/. I guess I'm stuck on the output: ??? piece as well?

nh13 commented 2 years ago

I was offering, but it’s not high on my list.

ryankelley-lm commented 2 years ago

@nh13 , I was hoping that a fully worked example wouldn't be necessarily be needed if it is just a question of how to structure the "output" line in the process definition such that the custom class will be appropriately instantiated by the framework, but perhaps that is wishful thinking.

Also, one additional request for clarification, are you saying in the approach that you use above that process results are appropriately cached and any suitably annotated paths in the custom class are also localized?

nh13 commented 2 years ago
  1. KryoHelper is part of nextflow, so no need to include it other than the import at the top of the custom class groovy file (yes, put the custom groovy file in /lib)
  2. Something like this should work:
    input:
    tuple val(meta), path(in_bam)

    output:
    tuple(val(meta), path("${meta.name()}.out.txt")         , emit: txt)
edmundmiller commented 2 years ago

@nh13 Okay I threw together an example

N E X T F L O W  ~  version 22.04.0
Launching `main.nf` [elated_bell] DSL2 - revision: 4a32a3cac6
BUG! exception in phase 'semantic analysis' in source unit 'Script_097d04d3' The lookup for CustomObject caused a failed compilation. There should not have been any compilation from this call.
ryankelley-lm commented 2 years ago

@Emiller88 , I suspect compilation of your custom class is failing on resolution of the "Path" type (needs to be imported?). Maybe can strip down example to use native types as a first pass.

edmundmiller commented 2 years ago

@Emiller88 , I suspect compilation of your custom class is failing on resolution of the "Path" type (needs to be imported?). Maybe can strip down example to use native types as a first pass.

This was it. I've pushed up a working test, but the class just replaces the map, so it's not really doing much.

ryankelley-lm commented 2 years ago

@nh13 , I'm assuming that in your previous example, the described workaround only allows for definition of the custom class, but not any associated localization of the files? If that is the case, I tried an alternate approach that would require some modification to the underlying Nextflow framework. Briefly, the custom class must be treated as a path input and provides a collection of associated paths that must be staged. While this addresses the file staging need, none of the custom attributes are accessible. For that, there is an additional interface generated, CustromFileAttributesProvider that the custom class can implement. If available, this interface method is used to define additional properties on the object exposed during script generation.

The changes to Nextflow are summarized in a PR here, https://github.com/ryankelley-lm/nextflow/pull/1, with an associated example leveraging the interface at https://github.com/ryankelley-lm/hello-custom-objects-1/tree/custom-object-download

but perhaps you were able to achieve this in a simpler way?

@pditommaso , the PR above is a rough first pass, but do you think the general concept is something that could be reworked into an acceptable PR?

CC @Emiller88

nh13 commented 2 years ago

@ryankelley-lm if you use a path it will work just fine for local paths. I’m not sure how or if remote files will work.

pditommaso commented 2 years ago

I think to handle this properly it's required to introduce a new input/output qualifier record as an alternative to tuple.

A tentative syntax could be as shown below:

@ValueObject
class SampleData {
    String name
    Path r1
    Path r2
}

process foo {
  output: 
  record SampleData(name: 'some_name', r1: '*_1.bam', r2: '*_2.bam')

  '''
  some_command 
  '''
}

process bar {
  input: 
  record data: SampleData()

  """
  other_command --sample $data.name --reads $data.r1 $data.r2
  """  
}

workflow {
  foo | bar
}

The idea is to continue to have tuple-like semantics both for the input and output declaration to have the ability to use custom record types

ryankelley-lm commented 2 years ago

@pditommaso , Thanks for the feedback. In the above, would "record" be an alternative to "tuple" or "val"? When you say that "record" would be an alternative to "tuple", do you imagine that you would be packaging multiple different custom object instances into a single "record" line, in the same way that multiple instances are packaged into a "tuple"? Or is a each "record" a single object instance, like with a "val"?

At some point, would you still require some contract to be implemented by the custom class, such that the framework knows what paths the custom object is dealing with?

ryankelley-lm commented 2 years ago

@pditommaso , or is the idea that a "record" is essentially a named tuple, such that the framework knows each of the data members and for path, can handle them appropriately?

ryankelley-lm commented 2 years ago

FWIW, I believe most of the desired behavior here is achievable with the current release of Nextflow if you

The idea is that Nextflow will see your custom object as a tuple and treat it accordingly. In your own scripts, you can use the specific behaviors of your custom object. Since Nextflow knows how to stage tuple appropriately, all of the staging will be handled correctly. This idea is captured in the updated example here, https://github.com/ryankelley-lm/hello-custom-objects-1/tree/custom-object-download

nextflow.enable.dsl = 2

process STEPONE_PROCESS {
    input:
    tuple val(myname), path(r1), path(r2)

    output:
    tuple val("${myname}"), path("output1.txt"), path("output2.txt"), emit: paths

    script:
    """
    cat ${r1} ${r2} >> output1.txt
    cat ${r2} ${r1} >> output2.txt
    """
}

workflow STEPONE_WORKFLOW {
    take: workflowInput
    main:
        processResult = STEPONE_PROCESS(workflowInput).map {
            name, path1, path2 ->
                return new CustomObject(name, path1, path2) 
        }
    emit:
        processResult
}
import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper
import java.nio.file.Path
import nextflow.util.ArrayBag
import java.util.List
import nextflow.file.FileHolder
import java.util.Map

class CustomObject extends ArrayBag {
    static {
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channles. This allows for caching when this object is used.
        KryoHelper.register(CustomObject)
    }

    String name

    Path r1

    Path r2

    CustomObject(String name, Path r1, Path r2)
    {
        super()
        this.name = name
        this.r1 = r1
        this.r2 = r2
        this.target.add(name)
        this.target.add(r1)
        this.target.add(r2)
    }

    public String toString()
    {
        return this.name + " " + this.r1.toString() + " " + this.r2.toString()
    }
}