Monitor data access urls

For this to work good. We would need some persistence in the catalog-rebuilder-flask app. The application would need to have some database to store the results of missing datasets.

Then in the catalog-rebuilder.py will implement a celery task, that will use pysolr to connect to the solr instance and loop through all documents that have a OPenDAP link, and check if the netcdf file can be opened using netCDF library. If it returns exception with Errno -90 Not found, we store the result in the database in the flask app. Optionally the mmd can be set to inactive in this job for the records where the netcdf file is not found.

example pysolr query and loop:

from pysolr import Solr

solr = Solr('<solr_url>')

# Search for documents with query '*:*'
batch_size = 100
response = solr.search(q='data_access_url_opendap:[* TO *]', start=0, rows=0)

# Get the total number of matching documents
total_results = response.hits

# Define dict of missing netcdf files
missing = dict()
# Iterate over the results in pages of 10 documents each
for page in range(0, total_results, batch_size):
    response = solr.search(q='data_access_url_opendap:[* TO *]', start=page, rows=batch_size, fl="metadata_identifier, data_access_url_opendap')
    for doc in response.docs:
       found = True
        try:
        nc = Dataset(doc['q='data_access_url_opendap')
        except:
          found = False
         add missing to dict

This functionality will also benefit us for the ADC/NBS index as well. So I might create this celery task in another repository, and then the catalog_rebuilder can use this task by importing it.

metno / catalog-rebuilder

Monitor data access urls #29