ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

GCOOS WAF & Harvesting #75

Closed mwengren closed 4 years ago

mwengren commented 4 years ago

@fgayanilo We're having some harvesting problems with GCOOS' WAF(s).

I think the primary issue is with this one: http://gcoos5.geos.tamu.edu:6060/erddap/metadata/iso19115/xml/, which now says it has ~ 6000 records in the Harvest Registry.

Second, I can't connect to this one from my location, which means the harvesting scripts likely can't either: http://gcoos4.tamu.edu:8080/erddap/metadata/iso19115/xml/.

Can you see about setting up port forwarding to a web server, and also see what's up with the 6000 records?

GCOOS now has the most records in the Catalog by far: https://data.ioos.us/dataset?_gcmd_keywords_limit=0&organization=gcoos.

fgayanilo commented 4 years ago

security breach on gcoos5 and we had to re-initialize the server and rebuild. We have a complete backup of the data and we should be back very soon (working on it as I write this). Sorry for the inconvenience.

leilabbb commented 4 years ago

The ERDDAP server for GCOOS is still down, we are working on getting it back up and running.

mwengren commented 4 years ago

@fgayanilo @leilabbb We're still seeing ~6000 records in the IOOS Harvest Registry for this GCOOS WAF.

Is there an issue with the code that's generating these files? I think this is going to present a problem for the IOOS Harvest Registry (https://registry.ioos.us/) to read all of these files. I can see that there's a harvest job running at present that isn't able to complete.

Can you re-investigate those processes? Are there actually 6000 XML files that are supposed to be getting harvested by the IOOS Registry/Catalog?

GCOOS' current dataset count in the Catalog is ~800, which are probably coming from the other two sources:

fgayanilo commented 4 years ago

@leilabbb is still working on getting our primary ERDDAP server for oceanographic and atmospheric data up and running. It should have about 5K of data in it. The ISO contains 475 records and gcoos4 ERDDAP (for biological records) has 304.

mwengren commented 4 years ago

Ok, please keep us updated and keep in mind whether this WAF URL registered in the Harvest Registry is going to be the right approach to use:

http://gcoos5.geos.tamu.edu:6060/erddap/metadata/iso19115/xml/

I can't even get a directory listing from it when I try to browse there, so it's unlikely the Registry will be able to harvest those records. You may need to use a secondary, more powerful web server to serve those.

fgayanilo commented 4 years ago

@mwengren the gcoos5 alternate server (http://erddap.gcoos.org:8080/erddap) is up and ready to serve but it looks like the registry.ioos.us is returning 504.

mwengren commented 4 years ago

Ok, pinging @benjwadams on the Registry status.

benjwadams commented 4 years ago

Server was having issues, had to restart. It looks like we're now harvesting from the contents of the ERDDAP server. Closing.

mwengren commented 4 years ago

I want to keep this open for monitoring purposes. We're at 26,000 datasets in the Catalog now and counting. It looks like the sheer number of datasets in this ERDDAP is going to stretch our infrastructure a bit for Catalog. @fgayanilo what will be the total count of datasets you plan to serve in this ERDDAP instance? We're well above the average RA already.

I didn't look through them all of course, but I take it there's no feasible way to aggregate some of these? A lot look to be historical, at least on the first page.

fgayanilo commented 4 years ago

That number is big and wrong @mwengren. The files should only be 7K++. We were waiting until the registry returns so we can update. Just checked and it seems to be working again. I changed the URL to the correct ERDDAP instance and initiated reharvest (see above, erddap.gcoos.org/erddap. Our old ERDDAP returned to service but it includes all other junks that should not be there. We are still in the process of syncing the instances, but the new server should be correct.

mwengren commented 4 years ago

Ok, we may have to do a manual clear of that GCOOS WAF. Catalog harvesting seems to be 'stuck' at the moment. @fgayanilo and @benjwadams can you work directly on harvesting the correct GCOOS WAF and clearing hung harvesting jobs/manually clearing the GCOOS harvest if necessary.

mwengren commented 4 years ago

Data from the current GCOOS WAF CKAN harvest (currently running) below. CKAN still shows ~13,000 datasets, so it may not be clearing out all the former records properly.

https://data.ioos.us/harvest/gcoos-waf

Id c6ca030e-e120-41c3-abfa-3fe84c1a8537
Created March 23, 2020, 2:01 AM (UTC-04:00)
Started March 23, 2020, 2:01 AM (UTC-04:00)
Finished
Status Running
fgayanilo commented 4 years ago

@benjwadams let me know what I can do from my end.

benjwadams commented 4 years ago

@fgayanilo This appears to have begun harvesting again. I'm not entirely sure what occurred, but it looks like the metadata is getting to the catalog now.

fgayanilo commented 4 years ago

@benjwadams that's great!

benjwadams commented 4 years ago

Going to close this out since the harvest appears to be working again.