nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 622 forks source link

fusion filesystem fails on google-batch/lifesciences with private repositories #3770

Closed Puumanamana closed 1 year ago

Puumanamana commented 1 year ago

Bug report

I've encountered issues using the fusion filesystem on google-batch (and google-lifesciences) for private GCP artifact repositories. I understand the support is relatively recent (23.02.1-edge), but I'm still putting it out there in case it can help prevent bugs later. The fusion filesystem (along with wave containers) seems to work fine when using public container images, but fails on private ones. With the same config, switching off fusion (fusion.enabled = false) makes the run successful. In case it matters, setting scratch=true or false (as recommended in the docs) didn't affect the issue.

After seeing error code 14 for GLS, I tried enabling the Service Control API without success.

Expected behavior and actual behavior

Expected: No error

Steps to reproduce the problem

// main.nf

nextflow.enable.dsl = 2

process P1 {
    input:
    path f

    output:
    path "*"

    script:
    """
    echo finished > log1.out
    """
}

workflow {
    ch = file("test.txt")
    P1(ch)
}
// nextflow.config

google {
    project = [GCP PROJECT]
    region = "us-central1"
    batch {
        bootDiskSize = 50.GB
        serviceAccountEmail = [SERVICE ACCOUNT]
        spot = true
    }
}

process {
    executor  = "google-batch"
    container = "us-docker.pkg.dev/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se"
    disk      = "50.GB"
    scratch   = true
}

fusion {
    enabled = true
}

wave {
    enabled = true
}

Program output

google-batch:

Error executing process > 'P1'

Caused by:
  Process `P1` terminated with an error exit status (null)

Command executed:

  echo finished > log1.out

Command exit status:
  null

  Discarding device blocks: done
  Creating filesystem with 98304000 4k blocks and 24576000 inodes
  Filesystem UUID: dcbca287-79ed-4b9d-bd88-276dc3fc50ea
  Superblock backups stored on blocks:
  See 'd32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

  Allocating group tables: done
  Writing inode tables: done
  Creating journal (262144 blocks): done
  Writing superblocks and filesystem accounting information: done

Command error:
  Error response from daemon: received unexpected HTTP status: 500 Internal Server Error
  mke2fs 1.46.5 (30-Dec-2021)
  Unable to find image 'wave.seqera.io/wt/786240688d5d/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se' locally
  docker: Error response from daemon: received unexpected HTTP status: 500 Internal Server Error.
  See 'docker run --help'.

Work dir:
  gs://nf-tower-public/scratch/ad/fc75b81163c333522e2b89eeef641e

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

google-lifesciences:

Error executing process > 'P1'

Caused by:
  Process `P1` terminated with an error exit status (14)

Command executed:

  echo finished > log1.out

Command exit status:
  14

Command output:
  (empty)

Command error:
  Execution failed: generic::unavailable: pulling image: docker pull: retry budget exhausted (10 attempts): running ["docker" "pull" "wave.seqera.io/wt/defb516ac1b9/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se"]: exit status 1 (standard error: "Error response from daemon: received unexpected HTTP status: 500 Internal Server Error\n")

Work dir:
  gs://nf-tower-public/scratch/32/3fab234827905fb30ef452c69c786c

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Also, if it helps, here's the google batch log:

$ gcloud batch jobs [JOB ID]
allocationPolicy:
  instances:
  - policy:
      disks:
      - deviceName: fusion
        newDisk:
          diskInterface: NVMe
          sizeGb: '375'
          type: local-ssd
      machineType: n1-standard-1
      provisioningModel: SPOT
  labels:
    batch-job-id: [JOB ID]
  location:
    allowedLocations:
    - regions/us-central1
    - zones/us-central1-a
    - zones/us-central1-b
    - zones/us-central1-c
    - zones/us-central1-f
  serviceAccount:
    email: [SERVICE ACCOUNT]
createTime: '2023-03-17T14:23:41.170099214Z'
logsPolicy:
  destination: CLOUD_LOGGING
name: [JOB ID]
status:
  runDuration: 0s
  state: FAILED
  statusEvents:
  - description: Job state is set from QUEUED to SCHEDULED for job [JOB ID].
    eventTime: '2023-03-17T14:23:47.051652878Z'
    type: STATUS_CHANGED
  - description: Job state is set from SCHEDULED to FAILED for job [JOB ID].
    eventTime: '2023-03-17T14:25:34.929412246Z'
    type: STATUS_CHANGED
  taskGroups:
    group0:
      counts:
        FAILED: '1'
      instances:
      - bootDisk:
          image: projects/batch-custom-image/global/images/family/batch-cos-stable-official
          sizeGb: '64'
          type: pd-ssd
        machineType: n1-standard-1
        provisioningModel: SPOT
        taskPack: '1'
taskGroups:
- name: [JOB ID]/taskGroups/group0
  parallelism: '1'
  taskCount: '1'
  taskSpec:
    computeResource:
      bootDiskMib: '51200'
      cpuMilli: '1000'
      memoryMib: '2000'
    runnables:
    - container:
        commands:
        - /usr/bin/fusion
        - bash
        - /fusion/gs/nf-tower-public/scratch/18/1694ef0c3bf50997b204d221939e7a/.command.run
        imageUri: wave.seqera.io/wt/342428e8630e/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se
        options: --privileged
        volumes:
        - /tmp:/tmp:rw
      environment:
        variables:
          FUSION_TAGS: '[.command.*|.exitcode|.fusion.*](nextflow.io/metadata=true),[*](nextflow.io/temporary=true)'
          FUSION_WORK: /fusion/gs/nf-tower-public/scratch/18/1694ef0c3bf50997b204d221939e7a
    volumes:
    - deviceName: fusion
      mountPath: /tmp
uid: [JOB ID]
updateTime: '2023-03-17T14:25:34.929412246Z'

Environment

pditommaso commented 1 year ago

Google LS does not support fusion. Regarding Batch, have you provided by private registry credentials via Tower?

Puumanamana commented 1 year ago

Yes, the TOWER_ACCESS_TOKEN environment variable is set (and it works since it runs without fusion enabled)

pditommaso commented 1 year ago

Can you please include the container name as it is specified in the config?

Puumanamana commented 1 year ago

I updated the post to include it

pditommaso commented 1 year ago

Are you using Google Artifact Registry or Container Registry?

Puumanamana commented 1 year ago

Artifact registries

pditommaso commented 1 year ago

Can you please enter again the credentials on tower.nf ?

Puumanamana commented 1 year ago

Just did (in my personal credentials, I deleted the container registry credentials for the private artifact registry and re-added it). Same error for now.

pditommaso commented 1 year ago

I think I've found the problem. We may release a patch by Monday

On Fri, Mar 17, 2023, 18:37 Cedric @.***> wrote:

Just did (in my personal credentials, I deleted the container registry credentials for the private artifact registry and re-added it). Same error for now.

— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/3770#issuecomment-1474185614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGHOSF2FK2CP3DAQWKPDT3W4SOOVANCNFSM6AAAAAAV6T32EE . You are receiving this because you commented.Message ID: @.***>

Puumanamana commented 1 year ago

Great, thank you!

pditommaso commented 1 year ago

Unfortunately, I'm still unable to replicate the issue. Can you please try to rerun it at your convenience?

Puumanamana commented 1 year ago

Still not working, I tried a few things:

I'll let you know if I find anything else.

Also, I don't know if it makes sense to do that, but I also tried using the local executor with fusion enabled (and a GS URI as work directory), and I had the same error.

Here's the .nextflow.log for that if it helps:

Mar-17 20:58:11.382 [FileTransfer-1] DEBUG nextflow.file.FilePorter - Copying foreign file /home/cedric/sandbox/troubleshoot/fusion/test.txt to work dir: gs://nf-tower-public/scratch/stage-bef7c30d-a2da-467b-9072-5f7d75582448/a1/7afc82dfaf1939258ad565a586d949/test.txt
Mar-17 20:58:11.662 [Actor Thread 3] DEBUG i.s.wave.plugin.config.WaveConfig - Wave strategy not specified - using default: [container, dockerfile, conda]
Mar-17 20:58:11.666 [Actor Thread 3] DEBUG io.seqera.wave.plugin.WaveClient - Wave server endpoint: https://wave.seqera.io
Mar-17 20:58:11.702 [Actor Thread 3] DEBUG io.seqera.wave.plugin.WaveClient - Wave request container config: https://fusionfs.seqera.io/releases/v2.1-amd64.json
Mar-17 20:58:11.892 [Actor Thread 3] DEBUG io.seqera.wave.plugin.WaveClient - Wave container config response: [200] {
  "layers": [
    {
      "location": "https://fusionfs.seqera.io/releases/pkg/2/1/6/fusion-amd64.tar.gz",
      "gzipDigest": "sha256:782f50229060010f4f8e8bb6c52822f3fc95dafef0ca742128998a307a1db0d3",
      "gzipSize": 13522690,
      "tarDigest": "sha256:382627a7a78ba495481489b036a99798e3c6245433c29685b0efdcc4b39740f1",
      "skipHashing": true
    }
  ]
}

Mar-17 20:58:11.939 [Actor Thread 3] DEBUG io.seqera.wave.plugin.WaveClient - Wave request: https://wave.seqera.io/container-token; attempt=1 - request: SubmitContainerTokenRequest(towerAccessToken:eyJ0aWQiOiA2OTgyfS45NTJmZjEwMjhmNjg3NTJkMWJjZmIxNTYyMDg4NmU2ZmQ3YTQ2Yjdl, towerRefreshToken:null, towerWorkspaceId:44413759927279, towerEndpoint:https://api.tower.nf, containerImage:us-docker.pkg.dev/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se, containerFile:null, containerConfig:ContainerConfig(entrypoint:null, cmd:null, env:null, workingDir:null, layers:[ContainerLayer[location=https://fusionfs.seqera.io/releases/pkg/2/1/6/fusion-amd64.tar.gz; tarDigest=sha256:382627a7a78ba495481489b036a99798e3c6245433c29685b0efdcc4b39740f1; gzipDigest=sha256:782f50229060010f4f8e8bb6c52822f3fc95dafef0ca742128998a307a1db0d3; gzipSize=13522690]]), condaFile:null, containerPlatform:null, buildRepository:null, cacheRepository:null, timestamp:2023-03-17T20:58:11.933374743Z, fingerprint:c6f62794090f039a74d43721d0a5ac6e)
Mar-17 20:58:12.519 [Actor Thread 3] DEBUG io.seqera.wave.plugin.WaveClient - Wave response: statusCode=200; body={"containerToken":"52d8b11047ee","targetImage":"wave.seqera.io/wt/52d8b11047ee/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se","expiration":"2023-03-18T18:58:12.464668808Z"}
Mar-17 20:58:12.976 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: docker run -i -e "FUSION_WORK=/fusion/gs/nf-tower-public/scratch/75/15b877bc26e72b9f97b182a2243050" -e "FUSION_TAGS=[.command.*|.exitcode|.fusion.*](nextflow.io/metadata=true),[*](nextflow.io/temporary=true)" --rm --privileged wave.seqera.io/wt/52d8b11047ee/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se /usr/bin/fusion bash '/fusion/gs/nf-tower-public/scratch/75/15b877bc26e72b9f97b182a2243050/.command.run'
Mar-17 20:58:12.978 [Task submitter] INFO  nextflow.Session - [75/15b877] Submitted process > P1
Mar-17 20:58:18.228 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: P1; status: COMPLETED; exit: 125; error: -; workDir: gs://nf-tower-public/scratch/75/15b877bc26e72b9f97b182a2243050]
Mar-17 20:58:18.235 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=P1; work-dir=gs://nf-tower-public/scratch/75/15b877bc26e72b9f97b182a2243050
  error [nextflow.exception.ProcessFailedException]: Process `P1` terminated with an error exit status (125)
Mar-17 20:58:18.299 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs://nf-tower-public/scratch/75/15b877bc26e72b9f97b182a2243050/.command.out
Mar-17 20:58:18.302 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'P1'

Caused by:
  Process `P1` terminated with an error exit status (125)

Command executed:

  echo finished > log1.out

Command exit status:
  125

Command output:
  (empty)

Command error:
  Unable to find image 'wave.seqera.io/wt/52d8b11047ee/rome-pipeline-engine/nxf-container-repo/l1em:master_fix-se' locally
  docker: Error response from daemon: received unexpected HTTP status: 500 Internal Server Error.
  See 'docker run --help'.
pditommaso commented 1 year ago

I think we made some progress. You may want to give another try

Puumanamana commented 1 year ago

Awesome, it works now!

pditommaso commented 1 year ago

Excellent! it was a problem with a URI redirect using a relative path made by Google AR