scitran / core

RESTful API
https://scitran.github.io
MIT License
18 stars 18 forks source link

Replacement of a file should trigger gear rule #1081

Closed ryansanford closed 6 years ago

ryansanford commented 6 years ago

A file was replaced with updated contents via the upload/reaper endpoint, resulting in the measurements and info fields being cleared. This replacement happened after the dicom-mr-classifier gear completed. The result was that the info and measurements fields were left empty, and the gear rule in place did not fire a new job to reconstitute their contents. Also, the generated nifti files are not in sync with the latest dicom.

ryansanford commented 6 years ago

@gsfr @lmperry Any unintended consequences with this change that should be considered?

The consequence of having mismatched niftis and unclassified dicoms impacted at least 4 studies out of 76 for a project at one of our sites. I'd like to move quickly on a resolution.

kofalt commented 6 years ago

Offhand I cannot think of any big ones, other than the obvious: jobs that refer to a file that no-longer / still-does exist would be confusing, etc. But making that reference unambiguous is more work than the straightforward fix described. Probably best left for another day.

kofalt commented 6 years ago

Notably, there should be no expectation that the same gears, or gear rules, will fire from the replacement file. For all intents & purposes it is a new file, that silently invalidates everything downgraph.

nagem commented 6 years ago

@ryansanford to ensure we are on the same page:

Currently, when a file on an acquisition is added/has it's metadata modified/is replaced, we evaluate gear rules against the entire container before and after and generate a list of potential jobs that could be spawned from the before container and the after container. Comparing the two, we only enqueue the set of jobs that exist in the after container but not the before. This is to prevent endless loops from container-has rules.

After this change, we will do the same, but if the modification is a replace of a file, we will enqueue jobs that exist only in the "after" list as well as jobs that use the replaced file as an input regardless of if it is in the "before" list.

nagem commented 6 years ago

Also, as part of this we should formalize the behavior around replacing files. #878 speaks to that discussion. There we only consider a file uploaded with a different hash to be a replace, but this ticket considers any replace to be a "new" file.

ryansanford commented 6 years ago

@nagem Sounds correct in the normal case.

Does this open a new door for endless loop if a gear replaces one of its input files? If yes, any reasonable approaches to avoid that?

nagem commented 6 years ago

I do not know of any theoretical gears that would replace their own inputs, seems counterintuitive to the reproducibility idea behind gears. They might set metadata/attribute information, but replacing the file object seems unintended. I can write additional code that rejects this kind of action so we stop an endless loop scenario, but it will be an adjustment to what is "allowed" by gears. Or at least formalizing something that had previously been assumed but not noted.

kofalt commented 6 years ago

If you set a gear rule for X, that consumes X, that produces X, you get what you deserve... I would not worry about infinite loop handling, IMHO.

Also, outright stealing my pseudocode from the previous ticket:

def replace(new, old):
  if new.hash == old.hash:
    # do literally nothing, including no modified update; return 304 Not Modified
  else:
    # delete all classification / info / whatever fields; store new; return 200
nagem commented 6 years ago

If you set a gear rule for X, that consumes X, that produces X, you get what you deserve... I would not worry about infinite loop handling, IMHO.

It would be simple to fail on this behavior, why not add it? I'd rather 400/500 than cause a situation that spawns jobs continuously until someone notices.

nagem commented 6 years ago

# do literally nothing, including no modified update; return 304 Not Modified

We cannot return a 304 if the user uploaded multiple files at once, some of them replaces with identical hashes, some with different, and some new files entirely.

kofalt commented 6 years ago

It would be simple to fail on this behavior, why not add it? I'd rather 400/500 than cause a situation that spawns jobs continuously until someone notices.

Up to you; I feel rather strongly that gears are black boxes and we shouldn't make assumptions about them. This does seem like a rather obvious case where it wouldn't be intended, but we've been wrong on that front before.

nagem commented 6 years ago

Using hash to determine replace as discussed in #878 is one understanding, I want to make sure the context around this ticket doesn't assume a different understanding.

Lets say there are 2 simple rules:

If Gear 1 is rerun and produces the same Y (via hash), Rule 2 will not be reevaluated and Gear 2 will not run.

What if Gear 1 is modified to produce more metatdata but uploads the same Y (via hash). Gear 2 will not run and Y's metadata would not be updated.

Is that what you would expect @ryansanford?

kofalt commented 6 years ago

That would agree with my understanding. It is not feasible to do a metadata comparison.

nagem commented 6 years ago

The other option would be any upload that has the same name as an existing file would be considered a replace regardless of hash match. Including a replace of all metadata.

gsfr commented 6 years ago

The reaper has unavoidable edge cases, in which it will upload the same files more than once. We generally wouldn't want to re-trigger an entire pipeline in those cases.

Could we not include the metadata in file replacement evaluations?

ryansanford commented 6 years ago

Using hash to determine replace as discussed in #878 is one understanding, I want to make sure the context around this ticket doesn't assume a different understanding.

Lets say there are 2 simple rules:

  • Rule 1: When X is present, run Gear 1 that produces Y
  • Rule 2: When Y is present, run Gear 2 that produces Z

If Gear 1 is rerun and produces the same Y (via hash), Rule 2 and Gear 2 will not run.

What if Gear 1 is modified to produce more metatdata but uploads the same Y (via hash). Gear 2 will not run and Y's metadata would not be updated.

Is that what you would expect @ryansanford?

Yes, that is what I would expect.

Let me know if these assumptions are off...

nagem commented 6 years ago

We will implement "replace" as a hash comparison. That should alleviate the reaper issue with sending the same file multiple times. If the reaper does send a different file, we would and should rerun gears.

Eventually with OS providers, we might not have easy access to the same hash we use internally if the upload is not streamed through the API. To prevent missing hashes, we will allow clients to upload via a signed url only when we trust that they will provide us with an accurate hash after upload completion (for example, an engine). This should allow all planned work to continue while still maintaining a dependency on accurate hashes at the time of upload.

I believe this topic is "Settled" and I will begin work now.