ioos / registry

Getting data services registered in the IOOS Service Registry
http://ioos.github.io/registry/
2 stars 7 forks source link

Fix lots of bad unidata ucar links #55

Open rsignell-usgs opened 9 years ago

rsignell-usgs commented 9 years ago

I wrote a little script to query the NGDC CSW for all the OPeNDAP endpoints and found 2785 links but 1100 of them either timed out after 2 seconds or gave 404 errors.

The bad ones are here:https://github.com/rsignell-usgs/system-test/blob/master/Theme_1_Baseline/bad.csv

About 400 of these bad links are 404 errors from Unidata's motherlode server (thredds.ucar.edu/thredds).

I think this is because Kyle ran his crawler to create the iso records before Unidata updated the TDS to the latest version, which changed the URLS to datasets containing grib files.

@kwilcox , can you please run your crawler again to update the WAF where these ISO records are harvested from?

@amilan17, when records disappear from a harvested WAF (like hopefully these bad ones will soon), they disappear from the catalog as well, right?

kwilcox commented 9 years ago

@rsignell-usgs http://thredds.ucar.edu/thredds is harvested daily (it wasn't a one time thing). Judging by the modified dates in the WAF (http://thredds.axiomalaska.com/iso/unidata/) it doesn't appear to be a harvesting problem. I checked a few of the DAP endpoints in the ISO files and they are correct.

Could it be that EMMA has not harvested the WAF in awhile? Did we ever put together a timeline of when things happen in the registry system?

rsignell-usgs commented 9 years ago

@kwilcox, is this the WAF that the Unidata iso records are being harvested from?https://www.ngdc.noaa.gov/docucomp/collectionSource/show/7981042?layout=fluid

amilan17 commented 9 years ago

Not exactly, you need to request a cleanup to remove the older out of date records. And THEN all copies of records downstream will follow suite

amilan17 commented 9 years ago

EMMA is harvesting nightly and this is the list of broken/slow links for your review: http://www.ngdc.noaa.gov/docucomp/page?xml=NOAA/IOOS/Unidata/iso/reports/BadOLRReport.xml&view=badOlrReport&title=NOAA/IOOS/Unidata%20Broken%20URLs

The information about the registry steps are on the github site: https://github.com/ioos/registry

amilan17 commented 9 years ago

This is the WAF that Unidata records are harvested from: https://www.ngdc.noaa.gov/docucomp/collectionSource/list?recordSetId=8824731&componentId=&serviceType=&serviceStatus=APPROVED&serviceUrl=&search=List+Collection+Sources

rsignell-usgs commented 9 years ago

@amilan17

  1. I believe all those broken link datasets should be removed.
  2. I thought we agreed that for harvest from WAF, only datasets harvested from the latest WAF would be present in the database. This allows regions to control what datasets they want, introducing new datasets as they arrive, and retiring datasets that are no longer appropriate.

@kwilcox, do you agree with 1 and 2?

kwilcox commented 9 years ago

:+1:

rsignell-usgs commented 9 years ago

Didn't you mean :+1: :+1:

rsignell-usgs commented 9 years ago

@amilan17, okay, please clean all that crap out!

amilan17 commented 9 years ago
  1. If they are removed from the source WAF then they will be removed from the Unidata WAF when I set up the clean out - this will probably become visible on Mon or Tues.
  2. ideally yes. However, our current implementation does not have the ability to tailor this behavior based on the collection source type. For instance, a WAF in EMMA can be populated by metadata from multiple source types - THREDDS, WAF and an SOS. We used to routinely clean out every WAF in EMMA, but it often resulted in missing records for a variety of reasons. So we decided to run clean up upon request.

Anna ~~~~~~~ Anna.Milan@noaa.gov, 303-497-5099 NOAA/NESDIS/NGDC

http://www.ngdc.noaa.gov/metadata/emma ~~~~~~~

On Fri, Sep 26, 2014 at 12:52 PM, Rich Signell notifications@github.com wrote:

@amilan17 https://github.com/amilan17

1.

All those broken link datasets should be removed. 2.

I thought we agreed that for harvest from WAF, only datasets harvested from the latest WAF would be present in the database. This allows regions to control what datasets they want, introducing new datasets as they arrive, and retiring datasets that are no longer appropriate.

@kwilcox https://github.com/kwilcox, do you agree with 1 and 2?

— Reply to this email directly or view it on GitHub https://github.com/ioos/registry/issues/55#issuecomment-57004824.

ebridger commented 9 years ago

What does slow mean?

Some of the links in the NERACOOS bad list do in fact work. We did just have a buoy redeployment which requires some THREDDS reconfiguring, so perhaps they were related to that.

Eric

On Sep 26, 2014, at 2:35 PM, Anna Milan notifications@github.com<mailto:notifications@github.com> wrote:

On Fri, Sep 26, 2014 at 6:25 AM, Kyle Wilcox notifications@github.com<mailto:notifications@github.com> wrote:

Could it be that EMMA has not harvested the WAF in awhile? Did we ever put together a timeline of when things happen in the registry system?

EMMA is harvesting nightly and this is the list of broken/slow links for your review: http://www.ngdc.noaa.gov/docucomp/page?xml=NOAA/IOOS/Unidata/iso/reports/BadOLRReport.xml&view=badOlrReport&title=NOAA/IOOS/Unidata%20Broken%20URLs

The information about the registry steps are on the github site: https://github.com/ioos/registry

Anna ~~~~~~~ Anna.Milan@noaa.govmailto:Anna.Milan@noaa.gov, 303-497-5099 NOAA/NESDIS/NGDC

http://www.ngdc.noaa.gov/metadata/emma ~~~~~~~

— Reply to this email directly or view it on GitHubhttps://github.com/ioos/registry/issues/55#issuecomment-57002670.

rsignell-usgs commented 9 years ago

@amilan17 , can you please clean up this WAF directory also:

https://www.ngdc.noaa.gov/docucomp/collectionSource/list?recordSetId=8824731&componentId=&serviceType=&serviceStatus=APPROVED&serviceUrl=&search=List+Collection+Sources

In this notebook http://nbviewer.ipython.org/3a340219cc62f5919059 Cell [59] still shows over 500 bad URLs from ucar, and @kwilcox updates the WAF every night, so clearing this out should remove these hundreds of errors!

amilan17 commented 9 years ago

Ok. Please check this WAF tomorrow: http://www.ngdc.noaa.gov/metadata/published/NOAA/IOOS/Unidata/iso/

rsignell-usgs commented 9 years ago

@amilan17, looking good on the WAF. Love going from 450 to below 100 bad in one easy step. We have to wait one more day for the entries to be removed from the geoportal, right?

10-1-2014 12-35-51 pm

amilan17 commented 9 years ago

@rsignell-usgs Unidata metadata is now updated in Geoportal. Just to clarify - the black line in the plot above does not represent "bad" metadata, just an overall count of valid ISO. However, by removing 300 (or so) records from the Unidata WAF - the count of records with bad URLs has gone down.

amilan17 commented 9 years ago

This is the list of remaining broken URLs. http://www.ngdc.noaa.gov/docucomp/page?xml=NOAA/IOOS/Unidata/iso/reports/BadOLRReport.xml&view=badOlrReport&title=NOAA/IOOS/Unidata%20Broken%20URLs

rsignell-usgs commented 9 years ago

@amilan17,, I see them removed from geoportal now. Awesome. Did you manually update geoportal? If you did, would it otherwise have updated tonight? Just trying to understand the "normal" schedule.

And one more question. I understand that you say the graph above shows valid iso records, but why did it linearly increase with time and then plateau? Kyle said that his ISO WAF only contained 78 records.

rsignell-usgs commented 9 years ago

@amilan17 , also my test shows only 3 broken URLs, not 81. How are you testing the URL? The OPenDAP Data URL needs a ".html" on the end to be opened in a browser.

rsignell-usgs commented 9 years ago

@amilan17 , the fact that there are still 3 bad URLs might be a clue to why the number of ISOs was growing. Perhaps there are 3 URLs on the motherlode server that change each day, resulting in differently named ISO records, and since the previous contents of the WAF are not cleared out, result in the slow accumulation of valid ISO records, but invalid DAP urls.

If this is the case, could a "clear" or "notclear" setting be associated with each WAF, and set to "clear" for this WAF?

amilan17 commented 9 years ago

@rsignell-usgs Yes. I performed a manual update in Geoportal. I've tried to set up syncing in geoportal as early as possible during the day, but sometimes I set up syncing later in the day and it will continue to sync at that same time every day. Unfortunately, I can configure or understand what time that is from the admin interface... (at least in our version of the ESRI geoportal).

Interesting question regarding the count of records increasing. During first harvest there were 90 records and now there are 72 records. I think that because of the default behavior of NOT cleaning out records nightly that anytime records are removed, renamed or added - a copy of the older records still remain in EMMA. I can keep the harvest setting for Unidata to always clear out all records before harvest and see if that results in more stable results. I'm currently comfortable doing this for the Unidata WAF in EMMA, because it is ONLY fed by another WAF - not a THREDDS server. Let's just hope that we don't have the opposite unintended affect of having all records disappearing!

amilan17 commented 9 years ago

And you are correct, that our checker is expecting a '.html' at the end of the URLs. When I run LinkChecker in Firefox - the URLs expecting a .html are flagged yellow and the truly broken URLs are flagged red. I wish our checker was more sophisticated!

rsignell-usgs commented 9 years ago

Excellent. Let's clean out the records nightly for this WAF. Of course, the WAF you are harvesting in this case actually is being created daily by crawling a THREDDS server, but at least if that process fails, you can still harvest the existing ISO records in the WAF.