We need a test to ensure that the data harvested from a data provider makes it all the way through our ETL pipeline and into the web application. The data is harvested, transformed in traject, and loaded into the DLME web application. We have test in Airflow to check that the number of records harvested matches the number of records in the Intermediate Representation (IR) after transform. However, there are still cases where traject will not through an error but Spotlight will not like something about a record in the IR. Sometimes this results in an error that surfaces when attempting to load the records into Spotlight but sometimes no error is surfaced and some of the records just don't load into Spotlight. In these cases, the only indication that something went wrong is that the record count in Spotlight doesn't match the record count in the IR. We need a way to compare the record count in the IR to the record count in Spotlight. This might be a browser test, or a Solr query, or maybe both.
We need a test to ensure that the data harvested from a data provider makes it all the way through our ETL pipeline and into the web application. The data is harvested, transformed in traject, and loaded into the DLME web application. We have test in Airflow to check that the number of records harvested matches the number of records in the Intermediate Representation (IR) after transform. However, there are still cases where traject will not through an error but Spotlight will not like something about a record in the IR. Sometimes this results in an error that surfaces when attempting to load the records into Spotlight but sometimes no error is surfaced and some of the records just don't load into Spotlight. In these cases, the only indication that something went wrong is that the record count in Spotlight doesn't match the record count in the IR. We need a way to compare the record count in the IR to the record count in Spotlight. This might be a browser test, or a Solr query, or maybe both.