rucio / rucio

Rucio - Scientific Data Management
http://rucio.cern.ch
Apache License 2.0
241 stars 309 forks source link

Provide last failure reason for each file in a stuck rule #781

Open bbockelm opened 6 years ago

bbockelm commented 6 years ago

Right now, the status of a replication rule only provides a single high-level error summary. For example:

$ rucio rule-info --examine db3277cdd4324dcc87074cc8c84ee324
Status of the replication rule: TRANSFER_FAILED:TRANSFER [5] TRANSFER CHECKSUM MISMATCH USER_DEFINE and SRC checksums are different. 09e3c4ef != 8f14f45a
STUCK Requests:
  cms:/store/mc/RunIISummer17PrePremix/Neutrino_E-10_gun/GEN-SIM-DIGI-RAW/MCv2_correctPU_94X_mc2017_realistic_v9-v1/20001/2E63DAAA-930C-E811-867F-0242AC130002.root
    RSE:                  T3_US_NERSC_REAL
    Attempts:             40
    Last Retry:           2018-02-26 07:52:37
    Last error:           FAILED
    Last source:          T2_US_Nebraska_REAL
    Available sources:    T2_US_Nebraska_REAL
    Blacklisted sources:  
  cms:/store/mc/RunIISummer17PrePremix/Neutrino_E-10_gun/GEN-SIM-DIGI-RAW/MCv2_correctPU_94X_mc2017_realistic_v9-v1/00032/2ABA00E8-A610-E811-ADDB-0242AC130002.root
    RSE:                  T3_US_NERSC_REAL
    Attempts:             40
    Last Retry:           2018-02-26 13:38:41
    Last error:           FAILED
    Last source:          T2_US_Nebraska_REAL
    Available sources:    T2_US_Nebraska_REAL
    Blacklisted sources:  

(this particular case goes on for another 200 files or so...)

However, it's not practical to debug this rule because I have no clue what error corresponds to which file.

Modifications

Two suggestions:

  1. In the output of rucio rule-info --examine ...., record the last error message per stuck file.
  2. Also record the last FTS job ID where the file failed (will probably also need the FTS server URL).
bari12 commented 6 years ago

For 1. this needs a little bit of work as the error is not propagated to the locks. It's only the last error stored at rule level.

  1. this information is in the requests_history, just needs to be added to the dictionary.