nextflow-io / nf-schema

Functionality for working with pipeline and sample sheet schema files in Nextflow pipelines
https://nextflow-io.github.io/nf-schema/
Apache License 2.0
12 stars 4 forks source link

Validation of Azure directory-paths causes issues #16

Closed apetkau closed 5 months ago

apetkau commented 11 months ago

I am trying to run a pipeline within Azure and use an Azure path for the output directory (e.g., az://12345/outputs/). However, the validation of this path causes issues:

Nov-03 16:19:26.907 [main] DEBUG nextflow.cli.Launcher - Operation aborted
nextflow.validation.SchemaValidationException: The following invalid input values have been detected:

* --outdir: 'az://12345/outputs/' is not a directory, but a file (az://12345/outputs/)

    at nextflow.validation.SchemaValidator.validateParameters(SchemaValidator.groovy:399)
    at nextflow.validation.SchemaValidator.validateParameters(SchemaValidator.groovy)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
...

This seems to be related to the outdir defined as format directory-path in the nextflow_schema.json file:

"outdir": {
    "type": "string",
    "format": "directory-path",
    "description": "The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.",
    "fa_icon": "fas fa-folder-open"
},

If the directory-path format is removed (or replaced with file-path), the pipeline successfully executes.

I am wondering if someone could help out with this? I am using the most recent version of nf-validation (1.1.1). Thanks.

nvnieuwk commented 11 months ago

Hi @apetkau

Thanks for reporting this!

We use the built-in functionality of nextflow to test the existence of all files, this will probably require some work to figure out the problem since it is probably in the Nextflow codebase. I would suggest using the path format for all your files for now since that won't do a check if it's a file or directory. I'll have a look at it when I got some time.

apetkau commented 11 months ago

Thanks so much @nvnieuwk. That makes sense.

adamrtalbot commented 9 months ago

Hi @apetkau, do you know which version of Nextflow and nf-validation you used? It should be in the logs.

apetkau commented 9 months ago

Hello @adamrtalbot. This is with Nextflow 23.04.2 and nf-validation 1.1.3. Thanks.

adamrtalbot commented 9 months ago

OK with 23.10.0 and nf-validation 1.1.3 we saw the problem the other way around, i.e. if the path did not include a slash nf-validation complained it was not a directory. Which is closer to the truth but still not accurate.

From nf-core megatests:

Pulling nf-core/fetchngs ...
 downloaded from https://github.com/nf-core/fetchngs.git
Launching `https://github.com/nf-core/fetchngs` [azure_fetchngs_small] DSL2 - revision: 04ee5031a4 [master]
Downloading plugin nf-validation@1.1.3
WARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/fetchngs v1.11.0-g04ee503
------------------------------------------------------
Core Nextflow options
  revision                  : master
  runName                   : azure_fetchngs_small
  launchDir                 : /mnt/resource/batch/tasks/workitems/nf-workflow-5GJmv03MWYhTOs/job-1/nf-workflow-5GJmv03MWYhTOs/wd
  workDir                   : /work/work/fetchngs/work-f794ea3cb15147c339cae6225c82a408834597a3
  projectDir                : /.nextflow/assets/nf-core/fetchngs
  userName                  : root
  profile                   : test
  configFiles               : 

Input/output options
  input                     : ${projectDir}/tests/sra_ids_test.csv
  outdir                    : az://work/fetchngs/results-test-f794ea3cb15147c339cae6225c82a408834597a3

Institutional config options
  config_profile_name       : Test profile
  config_profile_description: Minimal test dataset to check pipeline function

Max job request options
  max_cpus                  : 2
  max_memory                : 6.GB
  max_time                  : 6.h

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/fetchngs for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.5070524

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/fetchngs/blob/master/CITATIONS.md
------------------------------------------------------
ERROR ~ ERROR: Validation of pipeline parameters failed!

 -- Check 'nf-5GJmv03MWYhTOs.log' file for details
The following invalid input values have been detected:

* --input: the file or directory '${projectDir}/tests/sra_ids_test.csv' does not exist.
* --outdir: 'az://work/fetchngs/results-test-f794ea3cb15147c339cae6225c82a408834597a3' is not a directory, but a file (az://work/fetchngs/results-test-f794ea3cb15147c339cae6225c82a408834597a3)

The fix: 0fb6002be0fa2f1dfe9a448e7f5b69dc6f7eef04

apetkau commented 9 months ago

Oh, interesting. We will try to test this out with Nextflow 23.10.0 then and see what happens. Thank you.

zeehio commented 8 months ago

Thanks for looking into this!

I was having this same issue, on an azure storage without hierarchical namespaces and on nextflow 23.04.1. I should update nextflow and try again, but I believe we will find consistent results.

I'd like to add my two cents to the issue anyway:

My storage account has hierarchical namespaces disabled. I'm no expert, but I've seen that in such storages it is possible to create a file AND a folder with the same name. For instance, you can create two files, the first one at some/file.txt and the second at some/file.txt/insidefolder.txt. To my understanding, the reason behind this is that folders do not actually exist, and the path is just some attribute metadata you assign to a file.

Therefore in my opinion appending / at the end of the path would be required for azure storage directories.

I am now curious of what would happen with the path some/file.txt when the storage container is mounted locally on a linux system using blobfuse2, I bet some error happens, otherwise it would be both a file and a directory... who knows...

apetkau commented 8 months ago

I forgot to update this issue with our tests with Nextflow 23.10.0, sorry about that. Everything works after upgrading Nextflow. That is, a path with a trailing / in Azure (like az://path/to/dir/) is successfully validated as a directory for a Nextflow pipeline. Thanks so much for the suggestion.

@zeehio thanks so much for your input. I did not know you could have the same name be interpreted as both a file and a folder, but it makes sense. It also makes sense then that folders need to have an appended / in Azure storage.

adamrtalbot commented 8 months ago

OK I wrote a script to iterate through some options. Currently, the nf-validation, fetchngs and Nextflow version doesn't seem to make a difference - all of them except the latest seem to have this error.

However, the version of nf-azure does seem to work. I tried with the following versions:

There's possibly a subtle interaction with Nextflow version as well which might be worth looking into, but I figured if nf-azure is the cause I will focus on that until I identify the specific version that causes the problem.

Reminder, nf-azure version * outdir with/without slash. Results:

with slash without slash
1.0.1 :white_check_mark: :x:
1.3.3 :white_check_mark: :x:
1.4.0 :white_check_mark: :x:

I suspect the problem is with the nf-azure plugin not handling the pseudo-directories in Azure correctly, will look into it and raise an issue there.

So in summary, the current workaround is to use the latest stable version of everything, with a slash in the outdir. If you do this it should work:

apetkau commented 8 months ago

Wow. Awesome. Thanks so much for all your help @adamrtalbot in looking into this 😄

adamrtalbot commented 5 months ago

As of Nextflow v23.10.1, nf-validation 1.1.3 and nf-azure 1.3.3 this problem has popped back up.

Easiest solution would be to ignore file or directory validation for Azure and GCP storage since it's fake anyway.

Recreate it with

nextflow run nf-core/sarek -r 3.14.0 -profile test --outdir az://bucket/outputs

Using the above mentioned versions.

adamrtalbot commented 5 months ago

Using the edge release (v24.03.0-edge) does not fix this.

It's possibly because Nextflow explicitly checks for a directory attribute, which is probably not true if the path does not exist: https://github.com/nextflow-io/nextflow/blob/019eb86c2c0169f18f115a0924dcdf3cb958f981/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzPath.groovy#L120-L127

adamrtalbot commented 5 months ago

OK here's the line in nf-schema: https://github.com/nextflow-io/nf-schema/blob/252c714a49210318bb152d0726a04b5299d5c881/plugins/nf-schema/src/main/nextflow/validation/CustomEvaluators/FormatFilePathPatternEvaluator.groovy#L43

Which calls this method in nf-azure: https://github.com/nextflow-io/nextflow/blob/019eb86c2c0169f18f115a0924dcdf3cb958f981/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzPath.groovy#L96-L98

Which is set here: https://github.com/nextflow-io/nextflow/blob/019eb86c2c0169f18f115a0924dcdf3cb958f981/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzPath.groovy#L76-L88

@pditommaso I'm not familiar with the code here, is there any way to tell a file or directory apart in Nextflow? Or should we should handle it on the nf-schema side?