ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Dataset Count is low #147

Closed mwengren closed 7 years ago

mwengren commented 7 years ago

Dataset count seems low today, any known issues on the CKAN front that might be the cause? Currently shows ~4200 vs 7000 in the Registry.

Seems to be across the board (not necessarily one particular provider). The obvious culprit would be Solr, which seems to have periodic issues. Any info in log files?

Have you ever tried looking at the Solr webapp behind the firewall for troubleshooting (data.ioos.us:8080/solr/#/~cores/ckan-schema-2.0, or something like that)?

mwengren commented 7 years ago

hey @benjwadams @lukecampbell any idea what might be happening here? It seems like a few days rest/reharvest didn't do anything to help CKAN, the current dataset count discrepancy is 7046 (Registry) to 4259 (Catalog).

We have a DMAC review with NANOOS next Monday, and I'd like to have a full picture of NANOOS' datasets prior to that if possible. Right now it's only 36, and it should be more like 70.

lukecampbell commented 7 years ago

I suspect that this issue is related to something I posted earlier: https://github.com/ioos/catalog-harvesting/pull/46

The uniqueness per filename is only the first 43 characters. So, if the first 43 characters of a dataset name aren't unique then they'll override the file.

For example, in the NANOOS WAF:

http://data.nanoos.org/metadata/ioos/52nsos/

You'll spot SEVERAL links that are not uniquely named. Those links won't get downloaded properly.

It would be better to show them as errors in the registry to indicate what happened. I'll take a look this afternoon to see if I can find some sort of fix. I was toying around with the idea of naming every single file downloaded a UUID of some sort.

mwengren commented 7 years ago

Is this something that was added to CKAN recently via merge or otherwise? At one point all of the NANOOS datasets were harvested properly by the Catalog, so something must have changed if this is the real cause.

From a quick glance, it seems that dataset counts are down across the board (NDBC only has 441, CO-OPS 20), both should be much more, 1000+. My suspicion is Solr, based on other recent issues, but if this filename limit is new to CKAN or was somehow disabled before when all these organizations harvested properly, I suppose that could be it.

lukecampbell commented 7 years ago

NANOOS now has 80+ datasets on CKAN after the change. The other organizations should follow suit. I'll kick off a reharvest now.

lukecampbell commented 7 years ago

@mwengren is this still an issue?

mwengren commented 7 years ago

No, but I think the opposite may be a problem now, see new comments on ioos/catalog-ckan#148. I think there are Solr index issues with the reharvest of all the hashed XML source records in the Registry WAFs. Let's switch the discussion to that issue, closing this one.