scitran / core

RESTful API
https://scitran.github.io
MIT License
18 stars 18 forks source link

Mark failed job output #1015

Closed ambrussimon closed 6 years ago

ambrussimon commented 6 years ago

Resolves #968

Notes/TBDs:

Review Checklist

codecov-io commented 6 years ago

Codecov Report

Merging #1015 into master will increase coverage by 0.07%. The diff coverage is 91.07%.

@@            Coverage Diff             @@
##           master    #1015      +/-   ##
==========================================
+ Coverage    90.7%   90.78%   +0.07%     
==========================================
  Files          50       50              
  Lines        6802     6869      +67     
==========================================
+ Hits         6170     6236      +66     
- Misses        632      633       +1
kofalt commented 6 years ago

To be clear, would the engine call /prepare-complete, get a ticket, and then call /upload/engine with that ticket ID? In what scenario would someone call /accept-failed-output?

ambrussimon commented 6 years ago

@kofalt Correct. Formalizing just a bit more (current implementation):

On /accept-failed-output: my naive assumption was that there might be algs that have meta/files in their failure output that users still would like to store/further analyze. This shouldn't be the default though, hence the marking, but leaving the option.

Another TBD I forgot about was whether to require the job_ticket_id URL param at all during /engine/upload, since I can find the relevant ticket just from the job_id... But then we are ever closer to just storing the success/failure on the job itself.

kofalt commented 6 years ago

Is there a purpose for the level, id, and job_id fields if there is already job_ticket_id? I would remove all but the later.

Other than that, I think this sounds good. I will have to think about it again later when I do a proper code review & test extensively with the current engine.

Another TBD I forgot about was whether to require the job_ticket_id URL param at all during /engine/upload, since I can find the relevant ticket just from the job_id

I would not do that; it is not valid to assume that there is a 1:1 ratio of jobs to job-completion-tickets.

kofalt commented 6 years ago

Also, do you have any plans for disabling the old upload behavior in a later release? Probably just flipping a switch that rejects /engine/upload if there is no job_ticket_id, and removing any related code.

ambrussimon commented 6 years ago

@nagem @kofalt Fixed/added everything discussed + rebased. PR now ready for final review.

kofalt commented 6 years ago

Code looks pretty good, I made some comments above that ended up in collapsed sections, please check them. Some code moved around so it's harder to tell which comments still apply.

I'm going to run at test job load on this API branch and report back on my results.

nagem commented 6 years ago

One requirement that I realized was missed in the writeup on #968 is that jobs should not spawn until file results are accepted. The function signature for calling jobs shouldn't be too difficult to reproduce, since we are also applying the metadata here:

rules.create_jobs(db, container_before, container_after, container_type)

Note: this can be called after all files are accepted (doesn't need to be per file like it is in placer logic), and is only called on non-analysis jobs.

We would also need to modify rule evaluation to skip files that have the failed marking. For efficiency we can also skip it in the placer if the job is failed but it would need to exist in the rules because the create_jobs logic checks to see all jobs that could have been spawned by the old container and all jobs that can be spawned with the new container and creates the ones in the diff between new and old.

Sorry I didn't notice this until now, let me know if anything is unclear in my description above and we can sort it out. After that change and the one comment I added about ContainerReferencing 500ing, this is good to go in my opinion. @kofalt looks like he had a check he'd like to run before his approval, though.

kofalt commented 6 years ago

Correct. I've been out for a few days and been unable to run my tests, but it sounds like one more modification needs to be made. Once the final changes are ready, I'll hit this branch with the engine properly to make sure it's good to go.

ambrussimon commented 6 years ago

@kofalt @nagem Fixed. Please go ahead and hit it with the engine.

nagem commented 6 years ago

Final changes look good from my end 👍. Nice work on this ticket!

kofalt commented 6 years ago

Did a quick rebase to make sure branch is still good, I hope you don't mind. Feature checks out on current engine from my side of things.

Feature looks good to go pending a question above.

nagem commented 6 years ago

Access log is added, feel free to merge when CI is green @ambrussimon.

ambrussimon commented 6 years ago

Verified locally that no rebase is needed. Merging.