metno / catalog-rebuilder

Catalog rebuilder of CSW and Solr for the S-ENDA project
Apache License 2.0
0 stars 0 forks source link

Monitor data access urls #29

Open mortenwh opened 2 months ago

mortenwh commented 2 months ago

Apparently, some datasets are no longer available via their data_access links. We could use the catalog-rebuilder to monitor the status of urls.

@magnarem - can you add details, if necessary?

magnarem commented 2 months ago

For this to work good. We would need some persistence in the catalog-rebuilder-flask app. The application would need to have some database to store the results of missing datasets.

Then in the catalog-rebuilder.py will implement a celery task, that will use pysolr to connect to the solr instance and loop through all documents that have a OPenDAP link, and check if the netcdf file can be opened using netCDF library. If it returns exception with Errno -90 Not found, we store the result in the database in the flask app. Optionally the mmd can be set to inactive in this job for the records where the netcdf file is not found.

example pysolr query and loop:

from pysolr import Solr

solr = Solr('<solr_url>')

# Search for documents with query '*:*'
batch_size = 100
response = solr.search(q='data_access_url_opendap:[* TO *]', start=0, rows=0)

# Get the total number of matching documents
total_results = response.hits

# Define dict of missing netcdf files
missing = dict()
# Iterate over the results in pages of 10 documents each
for page in range(0, total_results, batch_size):
    response = solr.search(q='data_access_url_opendap:[* TO *]', start=page, rows=batch_size, fl="metadata_identifier, data_access_url_opendap')
    for doc in response.docs:
       found = True
        try:
        nc = Dataset(doc['q='data_access_url_opendap')
        except:
          found = False
         add missing to dict

This functionality will also benefit us for the ADC/NBS index as well. So I might create this celery task in another repository, and then the catalog_rebuilder can use this task by importing it.