ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

CSW endpoint seems down. Why? #24

Closed rsignell-usgs closed 7 years ago

rsignell-usgs commented 7 years ago

@mwengren or @lukecampbell , this morning neither @ocefpaf or I can reach the IOOS CSW endpoint using https://data.ioos.us/csw

Can we please not only restore it but figure out why it went down?

I 'm supposed to demo the IOOS CSW query at 10AM ET.

2017-01-25_6-17-10

mwengren commented 7 years ago

We're having some database issues that are affecting uptime currently. @lukecampbell has been working on it since yesterday. I'll let him update as of this morning but hopefully we can get it back online by 10.

mwengren commented 7 years ago

@rsignell-usgs CKAN/pycsw are restored, but there may be a need to do further troubleshooting later today until we can resolve the underlying issue. We'll hold off until your demo is complete though.

lukecampbell commented 7 years ago

I'm gonna use this ticket to track all discoveries and issues related to the db bloat.

The short story is that the DB is horribly bloated and CKAN leverages an Object Relational Mapper (ORM) which doesn't efficiently take advantage of optimized queries. So each time we harvest and create a revision record of the original metadata documents, the database queries get a little bit slower each round.

After enough rounds we reach where we are now, it can't scale very well passed this point. CKAN even admits in the code that certain actions are not performed efficiently. What I'm struggling with now is how to clean up the database to make the catalog run again, and find a way to ensure that it stays clean.

The simple obvious option was to purge the older revisions and deleted datasets. These are the datasets that providers remove from their WAFs and ultimately get removed in CKAN but they are only marked as removed, not actually removed until the db is purged.

Unfortunately, purging the db, last night, brought the entire machine down. It quickly ran out of resources and failed to be reachable.

@benjwadams was looking into talking with the CKAN folks on IRC to see if any had suggestions but I haven't heard anything.

At this point I'm thinking about restoring the database on another instance, wiping the packages with the nuclear option and harvesting everything and then move that db to production.

lukecampbell commented 7 years ago

screen shot 2017-01-25 at 10 00 20 am

mwengren commented 7 years ago

@lukecampbell I don't think we have any need for retaining the old revisions of the various objects. Is doing a clear and reharvest of each harvest source more efficient than the purge approach you were trying initially? Or is it the same operation on the db?

I think ideally we have to configure CKAN to not write to the _revision tables and just keep the latest and greatest on each harvest, but that may be complicated, not sure.

lukecampbell commented 7 years ago

the clear harvest doesn't delete, only purge actually deletes records. As best I can tell.

lukecampbell commented 7 years ago

On IRC, I got some suggestions. We're not the first to deal with this.

https://github.com/ckan/ckan/wiki/Performance-tips-for-large-imports#tags-and-extras

ebridger commented 7 years ago

@mwengren, @lukecampbell, @lance-axiom. Deciding to move our discussion of registering the new Realtime only NERACOOS WAF here. The new WAF is registered and made it to the catalog and I'm seeing it in the CKAN API json results. I'll run new Sensor Map tests tomorrow.

lance-axiom commented 7 years ago

@ebridger I have not removed the non-realtime SOS endpoints from the sensor Map yet. What URL should I used to remove those endpoints? I have deleted the old stations created from the last time you unblocked our IP address. Once you unblock us again they will be added again.

ebridger commented 7 years ago

@lance-axiom O.K. I set this up based on your original suggestion. Here's my python regex for the url's to exclude, after a check for www.neracoos.org

if re.search(r"UMO\/DSG\/", this_url, flags=re.IGNORECASE): continue Here's a url to include:

http://www.neracoos.org/thredds/sos/UMO/Realtime/SOS/B01/Met/Realtime.ncml?service=SOS&version=1.0.0&request=GetCapabilities

Let me know when you're ready and I'll start the test (tomorrow. ;-))

rsignell-usgs commented 7 years ago

@lukecampbell , good sleuthing! Did the geodata.gov guys confirm that their approach worked? Fingers crossed...

mwengren commented 7 years ago

@ebridger I only looked at a couple but it looks like ncSOS is not happy about something with your new endpoints: http://www.neracoos.org/thredds/sos/UMO/DSG/SOS/N01/SUNA100m/HistoricRealtime/Agg.ncml?service=SOS&version=1.0.0&request=GetCapabilities.

I don't think this will work from the Sensor Map side to use these new endpoints. Can you investigate?

Hmm, actually now those same endpoints look OK. Maybe you are making changes? I did see an SOS error message previously.

ebridger commented 7 years ago

@mwengren. Couple of points. 1) The url above is from the "old" UMaine WAF, i.e. HistoricRealtime/Agg. These are the large aggregated datasets which Lance is planning to filter out of the catalog for the Sensor Map harvest. 2) Not sure if it's a general THREDDS or ncSOS issue but often these SOS GetCap requests fail the first time they are requested, after a THREDDS restart, then start working. We restart THREDDS daily.

lance-axiom commented 7 years ago

@ebridger the sensor map is ready to test the NERACOOS SOS endpoints. The only endpoints that are going to be used are the ones that start with "http://www.neracoos.org/thredds/sos/UMO/Realtime/SOS". The other ones have been filtered out. Tell me when you have unblocked our IP address and I can kick the process off.

lukecampbell commented 7 years ago

@rsignell-usgs can you confirm that this is functional again?

rsignell-usgs commented 7 years ago

I ran this notebook https://gist.github.com/anonymous/6a087c3511b145a490e7bea8bc461876 and the site is working, but service URLs seem to be missing for some datasets that should have them.

See cell [12] for example.

lukecampbell commented 7 years ago

Thanks for the notebook for reproducibility! I'll take a look today hopefully.

lukecampbell commented 7 years ago

I'm gonna close this and open a new issue for the service stuff missing.