openwdl / wdl

Workflow Description Language - Specification and Implementations
https://www.openwdl.org/
BSD 3-Clause "New" or "Revised" License
776 stars 307 forks source link

Enable glob for input files #474

Open kirkgrubbs1 opened 3 years ago

kirkgrubbs1 commented 3 years ago

I have a program that wants me to input a fairly large list of files. So far I've been doing this but literally listing them out in array format (Ex: Array[File] files = ["a.fna", "b.fna", "c.fna", "d.fna", etc...]). I was thinking this would be much easier if I could use something like the output statement Ex: Array[File] files = glob("*.fna").

I would also appreciate any suggestions on how to avoid literally listing out all elements for arrays of files. I'm new, so I'm guessing I'm missing something.

jdidion commented 3 years ago

Are you hard-coding the file array in your WDL, or passing it in as an input? I suggest doing the latter, and you can write a script that creates the input JSON for you.

Another option is to use the Directory type, but that requires using the development version of WDL, which isn't universally supported yet.

kirkgrubbs1 commented 3 years ago

Writing a script to create the input JSON is the route I'm currently going about, but I just thought it was a bit cumbersome.

I did see the Directory type, but haven't gotten to updating to the development version yet.

kirkgrubbs1 commented 3 years ago

I guess my main issue with this is that it seemingly makes the language inconsistent. A function for gathering outputs from the disk would seem to logically apply to inputs as well. Such that you can gather inputs from the disk as well.

I don't know, maybe its just me. I also understand that it is rare that a program requires gathering different sets of thousands of files each run.

patmagee commented 3 years ago

You have to remember that globbing on disk during a task is a very different operation then globbing inputs from an arbitrary file system which may or may not be on disk. WDL was created first and Formost to be a distributed executtion language for running workflowa at scale in cloud environments. While this is also very easily transferable to a local hpc or computer, the language semantics largely reflect the cloud focus of its design.

There are situations where globbing input files would simply not work. For example if files are coming from arbitrary http urls, what would a glob actually mean? Most http endpoints don't know what globbing is, a d unless it's an ftp server it may not actually be discoverable. (Ie DRS apis from ga4gh dobt have a list functionality). If it's coming from a cloud bucket the globing semantics can be different depending on which cloud you are on. If you are running on a proprietary system (ie dnanexus) what does globbing mean when interacting with their file system.

While globbing inputs may seem like it's a good idea, imo there are too many situations where this wouldn't work to really make this a feature.

As @jdidion pointed out, the Directory type was introduced in an effort to solve some of these issues. There are certainly challenges with how the directory type is implemented as is, and it may not be stable moving forward.

kirkgrubbs1 commented 3 years ago

I think we may have gotten off on the wrong foot and that's probably due to some miscommunication on my part. I don't mean to be/sound mean or critical anything. I just thought it might be an easy enhancement since it is already implemented for another function.

As far as the limitation to use on a local disk, if glob for output is understood to work only "on disk", why would glob for input be expected to work for remote files? Admittedly, I come from a biology background and as such I mainly just bang on the keyboard until the machine does what I need it to, but glob is not a function I would use to gather remote data sources.

jdidion commented 9 months ago

I think this is better handled by the engine. For example, miniwdl can take input files on the command line, and if it doesn't already turn a glob into an array of filea that would be a reasonable request.

sknaack commented 8 months ago

Do I understand correctly this is still an issue in .wdl? I'm trying to pass an input path to a directory in my .wdl code from which to generate an input file arrays with a pattern string in an intelligent way. In this case a few file arrays of ~12 files each are required input for my workflow. This is primarily to simplify the required input information in the .json (to give an input directory path vs. a tedious arrays of long file names). Is the usage of glob for inputs from a given directory still not supported within WDL? or is there a specific version of .wld or a work-around to resolve this? It's surprising something this common sense is not yet anticipated yet if it hasn't been since it's used quite effectively within .wdl tasks,, so this can't be too impossible to implement. Thanks in advance for any details. If it helps I'm running .wdl code with cromwell.

jdidion commented 8 months ago

WDL is a standard for writing workflows that are (ideally) agnostic of the platform where they are running. This, it doesn't assume that your input files are in the same directory (or even on the same system) as where the workflow is being executed, nor does it assume that individual tasks will execute in the same environment of the workflow. So, writing a glob expression in your workflow may work fine when running the workflow locally, but it would fail if someone tried to run it e.g. using AWS batch.

A utility to simplify the specification of large numbers of inputs is a great idea but it's better implented by individual WDL implentations or in a wrapper script.

sknaack commented 8 months ago

I understand the concern about local/ batch running and not hardcode-ing paths, but I only mean to run glob on a directory based on a variable from an input .json where ever it is run. In some cases it may be from a directory of data copied from S3 in an ec2 instance in a batch mode, or a locally available directory if run from command line. Either way, defining an input array of files to a task with glob with an inputs task is a commonsense functionality to have. That this is supported for output arrays of files for task functions illustrates this in analogy. Perhaps one solution is to start with a small task function that copies a data directory from an S3 path, glob files into arrays as outputs of that first task function to pass forward to succeeding workflow steps.

markjschreiber commented 3 months ago

Currently in the development spec there isn't any mechanism to turn a Directory into a Array[File]. There is also no mechanism to scatter over a Directory. I think being able to declare an Array[File] using a glob would be very useful. It could also make the need for a Directory type redundant because you could just have something like:

Array[File] inFiles = "/path/to/directory/*"

Not needing a Directory type removes the need to be able to map it into something that can be scattered.