nih-cfde / submission-workflow

0 stars 0 forks source link

Clarification on success/failure of submission tool #53

Closed ACharbonneau closed 3 years ago

ACharbonneau commented 3 years ago

I think this might need to be in the documentation? Or maybe it's something that can be fixed/changed?

The normal workflow as far as I can tell is:

  1. submit package in tool
  2. tool reports initial success and tells me to check status later
  3. I get an email from Globus with success or failure
  4. I get an email from cfde-submission@nih-cfde.org with success or failure
  5. my package shows up in the portal

But now I'm hitting a failure mode that just stops my submission between steps 3 and 4. I thought it was a completely silent error, because I had gotten a success email from globus and then nothing else happened. But, apparently if I go and check cfde-submit status after a couple hours it does report a failure. As a user, I would not (and did not) think to go check the tool again after getting a success email, I only checked cfde-submit status later when David prompted me. So I think users encountering this will also assume that since they received a 'transfer succeeded' email, that the problem is after the submission tool.

I'm also trying to understand what the difference is between all of these success/failure checkpoints, since there are clearly more of them than I thought, and they report differently. As a user, if my submission fails between steps 3 and 4, I'd still expect to get a failure email from cfde-submission@nih-cfde.org, but I don't. So what triggers that email? Are there other possible fail points?

Specific failure: This was data for a new DCC, and the endpoint hadn't updated to have directories for them or to have the group permissions. That makes this behavior:

On submission, the tool seems to submit properly:

DEBUG:cfde_submit.client:Using local Globus Connect Personal Endpoint '04b54a52-e93d-11ea-9f05-0aba3c43875b'
Started DERIVA ingest flow
Your dataset has been submitted
You can check the progress with: cfde-submit status

And I get an email from globus that my transfer has succeeded:

image

and my status in the tool says:

Status of CFDE Submission (Flow ID 85a60149-8c65-4149-8a5a-3a5ce48de5a8)
This instance ID: d1939d86-6779-4702-be69-e793ac891764

This Flow is still in progress.
Current Flow Step: DerivaIngest

But then my transfer doesn't ever show up in the portal, and a couple hours later if I check the status in the tool again I get:

(cfdesubmit) (base) amandas-mbp-2:~ amanda$ cfde-submit status
Running on service 'staging'
Running on service 'staging'

Status of CFDE Submission (Flow ID 85a60149-8c65-4149-8a5a-3a5ce48de5a8)
This instance ID: d1939d86-6779-4702-be69-e793ac891764

This Flow has failed.
Current Flow Step: CheckDerivaIngestSubmission Flow failed: {
    "code": "FlowFailed",
    "description": "Error on state 'CheckDerivaIngest': processing path '$.UserState.DerivaIngestResult.details.error'",
    "details": {
        "exception": "States.Runtime",
        "json_path": "$.UserState.DerivaIngestResult.details.error",
        "state_name": "CheckDerivaIngest"
    },
    "error": "FlowFailed",
    "time": "2021-02-19T00:42:24.671000+00:00"
}

The package never goes to the portal, and the user never receives a cfde-submission@nih-cfde.org email

NickolausDS commented 3 years ago

The five steps above are a good description of the workflow. I think if I could boil down the central problem, it's related to this:

Specific failure: This was data for a new DCC, and the endpoint hadn't updated to have directories for them or to have the group permissions.

That's correct, we don't fully automate the process for provisioning new DCCs. Currently, it still requires a developer to run a command to ensure submitters in the new dcc group can run the flow. I think you got this error due to being a tester and having access to multiple DCCs. Normally, running a dataset submission for a DCC which hasn't been setup yet should result in an immediate fail on running cfde-submit run (Note: This model assumes users only have access to one DCC at a time).

We talked about automating the onboarding process for DCCs for the Action Provider (in Slack), but it's still currently a (one step) manual process. Looks like David already ran through and re-deployed the latest to staging, so this should be fixed. The latest problem seems to be the one here: https://github.com/nih-cfde/submission-workflow/issues/55

One thing strikes me from your post when talking about steps 3 and 4. Globus can report transfer success, but the ingest can still fail. Is it confusing for the user to see two emails? It seems like the first email for transfer gives a false sense of successful submission completion. We can disable step 3 to fix that. Note, that when not using GCP user's won't ever see emails from Globus Transfer.

ACharbonneau commented 3 years ago

I think you got this error due to being a tester and having access to multiple DCCs. Normally, running a dataset submission for a DCC which hasn't been setup yet should result in an immediate fail on running cfde-submit run (Note: This model assumes users only have access to one DCC at a time).

This isn't a good assumption. While I'm for sure a weird user because I have access on every DCC group, I'm only weird because my number is so high. Our use cases specifically call for users who are part of more than one DCC, and the moment I start onboarding people we will have at least 2 or 3 people who have permissions for more than one DCC. Off the top of my head Avi Maayan and Daniel Clark are at both LINCS and IDG. There's 2-3 modeling people that work at both SPARC and HuBMAP. Being in multiple DCCs simultaneously is a known use case.

Note, that when not using GCP user's won't ever see emails from Globus Transfer.

I didn't realize that.

One thing strikes me from your post when talking about steps 3 and 4. Globus can report transfer success, but the ingest can still fail. Is it confusing for the user to see two emails? It seems like the first email for transfer gives a false sense of successful submission completion. We can disable step 3 to fix that.

I don't think it's the first email that was the problem, but the lack of a second email. I usually get an email telling me that it failed, but this time it for sure failed, but didn't email me.

NickolausDS commented 3 years ago

Our use cases specifically call for users who are part of more than one DCC... This is good to know. As long as we run a deployment to onboard a new DCC before folks attempt to submit to it, it shouldn't be a problem. However, it does raise a permissions issue.

@karlcz, does Deriva protect against users submitting to the 'wrong' DCC if they don't have access? Currently, the Action Provider only prevents users submitting if they're not in any submitters group, but will allow a submitter with access to submit to the 'wrong' DCC (supplying a --dcc-id for a dcc they shouldn't have access).

I don't think it's the first email that was the problem, but the lack of a second email. I usually get an email telling me that it failed, but this time it for sure failed, but didn't email me.

If it's related to https://github.com/nih-cfde/submission-workflow/issues/55, it should send you a email noting failure, but it looks like another bug prevented it.

karlcz commented 3 years ago

The validate_dcc_id() and ingest() routine both enforce that the submitting_user is authorized to be a submitter for the submitting_dcc.

NickolausDS commented 3 years ago

Looking back on this issue, the original problem was submission errors due to a DCC not being provisioned, which was fixed by David. Secondly, I had a worry about authorization, but Karl put those to rest in the comment above.

@ACharbonneau Do we have any further action items with this Issue?

ACharbonneau commented 3 years ago

I don't think it's the first email that was the problem, but the lack of a second email. I usually get an email telling me that it failed, but this time it for sure failed, but didn't email me.

If it's related to #55, it should send you a email noting failure, but it looks like another bug prevented it.

I think that as long as the email bug is fixed we're good

NickolausDS commented 3 years ago

I think that as long as the email bug is fixed we're good

I checked this one over and found another potential problem where the action provider returns input which puts the flow in a state in which it can't send emails. The email bug should now be fixed in these rare circumstances.

ACharbonneau commented 3 years ago

I think this is working now. Will open a new issue if I see it again