Open mortenwh opened 2 months ago
For this to work good. We would need some persistence in the catalog-rebuilder-flask app. The application would need to have some database to store the results of missing datasets.
Then in the catalog-rebuilder.py
will implement a celery task, that will use pysolr to connect to the solr instance and loop through all documents that have a OPenDAP link, and check if the netcdf file can be opened using netCDF library. If it returns exception with Errno -90 Not found, we store the result in the database in the flask app.
Optionally the mmd can be set to inactive in this job for the records where the netcdf file is not found.
example pysolr query and loop:
from pysolr import Solr
solr = Solr('<solr_url>')
# Search for documents with query '*:*'
batch_size = 100
response = solr.search(q='data_access_url_opendap:[* TO *]', start=0, rows=0)
# Get the total number of matching documents
total_results = response.hits
# Define dict of missing netcdf files
missing = dict()
# Iterate over the results in pages of 10 documents each
for page in range(0, total_results, batch_size):
response = solr.search(q='data_access_url_opendap:[* TO *]', start=page, rows=batch_size, fl="metadata_identifier, data_access_url_opendap')
for doc in response.docs:
found = True
try:
nc = Dataset(doc['q='data_access_url_opendap')
except:
found = False
add missing to dict
This functionality will also benefit us for the ADC/NBS index as well. So I might create this celery task in another repository, and then the catalog_rebuilder can use this task by importing it.
Apparently, some datasets are no longer available via their
data_access
links. We could use the catalog-rebuilder to monitor the status of urls.@magnarem - can you add details, if necessary?