nfdi4plants / CWLGenerator

MIT License
3 stars 0 forks source link

Tests. #1

Closed Brilator closed 2 years ago

Brilator commented 2 years ago

Hey, I've set up a minimal example to test the tool with a simple R script, one input, one output (folder):

https://git.nfdi4plants.org/brilator/cwl_generator_test

Unfortunately, I'm running into an error. Which probably comes from naming the output twice: once as an input and once during the ? What is the name you want to assign to the output (e.g. results)?, which allows the option - If your output directory is specified under inputs: /$(inputs.NameOfTheInput).

Brilator commented 2 years ago

Details and log in the repo's README. I couldn't find @louperelo and @ZimmerA in the DataHub (registration -> https://register.nfdi4plants.org/registration). Let me know if I can add you.

ZimmerA commented 2 years ago

@Brilator You can add me now, I will take a look at it!

ZimmerA commented 2 years ago

Okay, I think a good way to approach this is to first think about how the working output file would have to look like. The first issue I found was that we are supplying the "calculate_means.R" script to the base command but CWL cannot find it because it does not exist in the CWL environment.

There are 2 Possible solutions for this:

- class: InitialWorkDirRequirement
  listing:
    - class: File
      location: calculate_means.R

Regarding CWLGenerator, the first option would be the simplest and would not require any changes.

The second issue is that we are supplying out_folder as a Directory. This means we cannot access it in glob without doing any extra-work (staging the directory). Instead, we would have to use outputEval

outputs:
- id: outdir
  type: Directory
  outputBinding:
    outputEval: $(inputs.out_folder)

But this does not feel very idiomatic, since it will place the files in the specified directory and not create a separate output folder. So I would sugest against this.

It would be simpler if we specified the output as a string. If we do however, we need to make sure that the R script creates the directory before writing to it. Alternatively we could create an InitialWorkDirRequirement that creates the folder for us.

Again, I would suggest going with option a) because this would require another edge case for the CWLGenerator which would make the questioning even more confusing (suggestions welcome).

So for me, the "perfect" output file would look like this:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2.0-dev1
class: CommandLineTool
inputs:
- id: r_script
  type: File
  inputBinding:
    position: 0
- id: in_file
  type: File
  inputBinding:
    position: 1
- id: out_folder
  type: string
  inputBinding:
    position: 2

outputs:
- id: outdir
  type: Directory
  outputBinding:
    glob: $(inputs.out_folder)

baseCommand:
- Rscript

The Job file for this would be:

r_script:
  class: File
  location: ./calculate_means.R
in_file:
  class: File
  location: input/expression_table.txt
out_folder: ./output

The "perfect" output is already achievable with CWLGenerator in the current state (We do have to improve the question phrases though).

CWLGenerator!
? What is the basecommand to run your tool (e.g. samtools view). Rscript
? How many inputs does your tool have? 3
Configuring Input 1
? What is the name you want to assign to the input (e.g. threads)? r_script
? What is the command prefix for your input (e.g. -t). Press Enter if there is none. 
? Does your input consist of multiple elements (e.g. a list of files or directories) No
? What is the type of the input (Directory/File/String/Number)? When your input consists of multiple files only specify the type of a single element. File
Configuring Input 2
? What is the name you want to assign to the input (e.g. threads)? in_file
? What is the command prefix for your input (e.g. -t). Press Enter if there is none. 
? Does your input consist of multiple elements (e.g. a list of files or directories) No
? What is the type of the input (Directory/File/String/Number)? When your input consists of multiple files only specify the type of a single element. File
Configuring Input 3
? What is the name you want to assign to the input (e.g. threads)? out_folder        
? What is the command prefix for your input (e.g. -t). Press Enter if there is none. 
? Does your input consist of multiple elements (e.g. a list of files or directories) No
? What is the type of the input (Directory/File/String/Number)? When your input consists of multiple files only specify the type of a single element. string
? Do you know the location of your output? The location is required as a Glob pattern.
You will be asked to specify it later for each output.
If the location is unknown, the entire working directory will be returned. Yes        
? How many outputs does your tool have?
A result folder for example is a single output. 1
Configuring Output 1
? What is the name you want to assign to the output (e.g. results)?
Warning: Do not repeat a name already assigned for an input. outdir
? Does your output consist of multiple elements (e.g. a list of files or directories) Yes
? What is the type of the output (Directory/File/stdout)? When your output consists of multiple files only specify the type of a single element. Directory
? Where is your output located?Examples:
- An output directory in the working directory: /myOutputDir
- An output directory in a subfolder of the working directory: /*/myOutputDir
- Files of the same type in the working directory: /*.txt
- Files of the same type in a subfolder of the working directory: /*/*.txt
- If your output directory is specified under inputs: /$(inputs.NameOfTheInput) /$(inputs.out_folder)
? Do you need additional requirements to run your workflow? No
? What is the name of your tool (without spaces)?
This will be the name of the .cwl file calculate_means

PS: I think there is an error in the R script on line 42: write.table(dfr, file = paste0(out_folder, "expression_withMeans.txt"), row.names=F, sep="\t", quote=F) should be: write.table(dfr, file = paste0(out_folder, "/expression_withMeans.txt"), row.names=F, sep="\t", quote=F) (I added a "/" before "expression_withMeans" because directories usually don't end with a "/")

Sorry for the wall of text. Let me know if anything is not clear!

ZimmerA commented 2 years ago

Actually, we do have one bug here! Pressing Enter on What is the command prefix for your input (e.g. -t). Press Enter if there is none. Should not yield prefix: "" in the inputBinding but instead yield nothing.

Brilator commented 2 years ago

should be: write.table(dfr, file = paste0(out_folder, "/expression_withMeans.txt"), row.names=F, sep="\t", quote=F) (I added a "/" before "expression_withMeans" because directories usually don't end with a "/")

Well, I defined the out_folder as output/. But you're right, your way is safer.

Brilator commented 2 years ago

Actually, we do have one bug here! Pressing Enter on What is the command prefix for your input (e.g. -t). Press Enter if there is none. Should not yield prefix: "" in the inputBinding but instead yield nothing.

Is this the reason it would still fail with your above "protocol"?

Brilator commented 2 years ago

Ah, sorry never mind. It did work after pulling your latest version.

ZimmerA commented 2 years ago

Awesome!