Otto has been seeing the archiver on one of his systems jam up after about 20 hours of operation. WLCG may be having the same problem. Otto is using the RMQ and HTTP archivers. WLCG is, too.
Based on the debug logs, it looks like the archiver is reaching its maximum parallel archivings and stopping because archivings aren't finishing. Need to determine why that is.
Known things to do:
[ ] Put a cap on the amount of time a worker thread should spend waiting for a process and fail nicely if none can be had. (See code.)
[ ] Probably need more debug in the process pools to see why they're not reporting on unfinished archivings or the program isn't coming to terms with processes dying if that's what's happening.
[ ] Should mark archivings in the database as underway and not do the TTL sweep on them when in that state. Let the archiver make a positive determination of how things went.
[ ] StreamingJSONProgram's _roundtrip() method could stand to have some kind of reasonably-long timeout provided by the instantiator in case the called program never returns.
Otto has been seeing the archiver on one of his systems jam up after about 20 hours of operation. WLCG may be having the same problem. Otto is using the RMQ and HTTP archivers. WLCG is, too.
Based on the debug logs, it looks like the archiver is reaching its maximum parallel archivings and stopping because archivings aren't finishing. Need to determine why that is.
Known things to do:
[ ] Put a cap on the amount of time a worker thread should spend waiting for a process and fail nicely if none can be had. (See code.)
[ ] Probably need more debug in the process pools to see why they're not reporting on unfinished archivings or the program isn't coming to terms with processes dying if that's what's happening.
[ ] Should mark archivings in the database as underway and not do the TTL sweep on them when in that state. Let the archiver make a positive determination of how things went.
[ ]
StreamingJSONProgram
's_roundtrip()
method could stand to have some kind of reasonably-long timeout provided by the instantiator in case the called program never returns.