sot / jobwatch

Watch files, database tables, and log files to ensure valid cron processing
3 stars 0 forks source link

Expectations for eng_archive available data by type #47

Open jeanconn opened 3 years ago

jeanconn commented 3 years ago

With regard to the eng_archive processing and data, currently jobwatch is checking that the time on the dp_pcad32 is not more than 2 days old:

https://github.com/sot/jobwatch/blob/fa99427c810e2edc6b264c8395f369331c503632/jobwatch/skawatch.py#L188

If I look through the other TIME.h5 files right now, it looks like most are available up to the data available at the last comm pass before ingest (about 30 hours old). The ephem data go out into the future. The acisdeahk values are stale by about another 5 hours or so. Should we have any more specific checking on any of these types in the archive?

Related to this, @Gregg140 had asked me about what we expect for turn around on the acisdeahk telem; I don't quite understand the CXCDS archive steps by which it may take longer to be available.

taldcroft commented 3 years ago

The current check is on the date of the TIME.h5 file. If that file has been touched by processing then it means new records were added. The dp_pcad32 content type is the last one (or near the end) that gets done, so this is a solid indicator that processing completed.

Assuming that the dp_pcad32/TIME.h5 file is newer than 2 days, this means is that the archive update process succeeded in ingesting all available new L0 data from the CXC archive, and that there was at least some new data.

What it doesn't tell you is how old the actual data in the files are. In other words, what are the latencies in CXC processing. I've generally taken the approach of assuming that DSops is staying as up-to-date as the system allows. I guess you are suggesting more detailed checks that DSops is doing their job.

About ACIS DEA housekeeping, there might be some connection with ACIS L1 processing, so that HK doesn't come down until an ACIS observation is complete and through processing. At least I do recall something like that, but I'm not sure it applies in this case. This is something that the DSops crew would be able to quickly answer.

jeanconn commented 3 years ago

Somewhat my mistake on this... I thought the TIME.h5 read was using the H5Watch (which I see is in the hourly watch) that reads the 'time' column and uses the last time.

jeanconn commented 3 years ago

I was not really suggesting more checks checks that "DSops is doing their job". For that, we could add a separate check on the availability of something else using just arc4gl. Really I was just noting that, to try to come out ahead of issues, I use jobwatch for both an overview (did the jobs work) and detailed assessment (are the outputs / representative files that are dependencies of other tasks up-to-date). It is useful to me to review expectations on "up-to-date" with the new jobwatch.

jeanconn commented 3 years ago

Regarding "I don't quite understand the CXCDS archive steps by which it may take longer to be available." with regard to acisdeahk, I had been wondering if they were tied to level 2 V&V (which didn't make much sense). Good to know that they are tied to acis science runs so needs a stop science in raw telemetry before they are spit out.

taldcroft commented 3 years ago

Agreed that awareness is a good thing. In the semi-infinite sea of things we need to get done, I would put this as low-moderate priority because I think the existing check will catch the most common case of either AP or our processing not running.

jeanconn commented 3 years ago

Right. Mostly I was just thinking for low-hanging-fruit about switching the checks here to read the last time from TIME.h5, adding checks on ephem and acisdeahk TIME, and throwing in some comments.