sul-dlss / libsys-airflow

Airflow DAGS for migrating and managing ILS data into FOLIO along with other LibSys workflows
Apache License 2.0
5 stars 0 forks source link

Airflow Task: Generate and Email Load report #267

Closed jermnelson closed 1 year ago

jermnelson commented 1 year ago

Given a VendorFile, add an airflow task that will gather data used to format an email report, create the report content, and send the email report.

Initially this task should be a Sensor task that queries FOLIO until the data below is available and then sends the report. At that time, we can update the file status as well (can be a separate ticket).

I've linked the API documentation below, you may find a better endpoint that provides better data (i.e. knowing the load is complete, has errors, etc... but for now the below is sufficient)

The report should provide at minimum:

Folio data can be fetched at the: /metadata-provider/jobSummary/{jobExecutionId} endpoint (https://s3.amazonaws.com/foliodocs/api/mod-source-record-manager/p/metadata-provider.html#metadata_provider_jobsummary__jobexecutionid__get) which produces output like:

'jobExecutionId': 'd7460945-6f0c-4e74-86c9-34a8438d652e',
 'totalErrors': 0,
 'sourceRecordSummary': {'totalCreatedEntities': 37,
  'totalUpdatedEntities': 0,
  'totalDiscardedEntities': 0,
  'totalErrors': 0},
 'instanceSummary': {'totalCreatedEntities': 37,
  'totalUpdatedEntities': 0,
  'totalDiscardedEntities': 0,
  'totalErrors': 0}}

Note that more data may be available through other APIs, feel free to investigate, but so far only minimal data has been shown to be available from FOLIO.

Format of the email:

Subject: Vendor Name (Vendor code) - (file_name) - Report Body:

FOLIO Catalog MARC Load started on Friday, May 12, 2023, 5:07 AM
  Filename: [INSERT FILENAME] - link to download the file.
  74 bib record(s) read from MARC file.
  74 SRS records created.
  0 SRS records updated.
  0 SRS records discarded.
  0 SRS record errors.
  74 Instance records created.
  0 Instance records updated.
  0 Instance records discarded.
  0 Instance record errors.

All emails will be sent to: folio-load-reports@lists.stanford.edu

jermnelson commented 1 year ago

@ahafele are there other reporting requirements that are needed?

ahafele commented 1 year ago

@jermnelson This looks good for a basic bib load. It will be slightly different for Order loads and Invoice loads but the basic structure can stay the same. Could we include the Interface name and list of HRIDs?

aaron-collier commented 1 year ago

@jermnelson and @ahafele based on our conversation this morning, it seems like we have access to the above information (outside of FOLIO via API) but not until we know the load completed.

I'm curious about our approach here, since that data is very likely to not be available during the data import dag run, should we have a reporting DAG that looks at files that are in the loaded state and attempts to fetch that data, then send the email report? This could be an hourly (or any other schedule really) DAG run, and I suspect we can define conditions on the data to determine if we are ready to send a report.

If we take that approach, do we also want an email going out when either the data fetch is complete or the data processing dag is complete, knowing the the data import job is likely pending?

Thoughts?

jermnelson commented 1 year ago

Hi @aaron-collier, I think having separate reporting DAG (hourly would be fine) that checks for files in a loaded state and then generates and emails report when the returned value from the API meets our criteria should be acceptable. I would be worried about user email fatigue if we sent emails three emails for each job (one for when the data fetch DAG is completed one for data processing DAG is completed, and then when one from the reporting DAG when the data import job is completed).

ahafele commented 1 year ago

HI @aaron-collier thanks for this. An hourly (or other) schedule sounds good, but agree with @jermnelson that we probably don't need all 3 emails. For sure we can skip an email when a data processing DAG is completed. An email when a data fetch DAG is complete would be useful mostly in the "fetch only" scenarios we discussed yesterday. If it is easier to make that notification across all interfaces, then we can just deal with a little extra email.

aaron-collier commented 1 year ago

@jermnelson / @ahafele - thanks, that helps. I'll update this ticket to be actionable, write the reporting dag ticket, and a ticket to report on fetch only interfaces as well..