Open jbusecke opened 2 months ago
Just to show how this breaks:
If I add this code to get_files
:
# Add **matching** facets to the search. It fails 😝
iid = dataset_ids[0].split('|')[0]
facets = facets_from_iid(iid)
facets['version'] = facets['version'].replace('v','')
# removed facets: , 'version','mip_era', 'activity_id', 'institution_id', 'source_id', 'variant_label', 'experiment_id', 'table_id', , 'grid_label'
for k in ['variable_id']:
params[k] = facets[k]
params.update(facets)
the assertion breaks... so ideally one should not at all add any facet to the request parameters for the 'file query'.
I have now experienced several times, when I was able to parse iids, but then the url search would not return anything. I think I finally understand why. Ok first lets establish two iid lists that work/don't work with
get_urls_from_esgf
:This confirms that we found NO info on any of the first set of iids, and info for all of the second set.
Now lets test this with intake-esgf:
intake-esgf finds info for ALL of the iids in either set!
So what the heck am I doing wrong here? Digging into the code of intake-esgf more I am getting a suspicion:
The general pattern of intake-esgf is to do two sorts of queries to the ESGF REST API
id
field which is formatted as "So this represents some sort of 'nested' query. If we try that approach with vanilla requests, we see that it works!
This is honestly pretty damn frustrating since nothing about this is mentioned in the API docs as far as I can tell. In fact they state that
'type'
input defines which kind of 'record' (File or Dataset) you will get back and then show examples of faceted search here and say this:All of this led me to believe that when I specify the identical set of facets and switch the
'type'
I would get the matching set of files and iids depending on the value I provide. I guess I was wrong 😩.The most disturbing thing is that some entries clearly work as I thought (otherwise I would have never gotten any results)...
Well at least I have a clue how to progress on this for now. Big thanks to @nocollier for all the work on intake-esgf. I would be curious where you learned that these 'nested' requests are needed to get all the data (I might just have missed something important).
I am fairly confident that with this knowledge I would be able to refactor large parts of pangeo-forge-esgf.
It might however be more practical to add a dependency to intake-esgf, even though the async request might still be a bit faster.