ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

Cannot find ERDDAP:tabledap endpoints when querying the catalog #62

Closed ocefpaf closed 6 years ago

ocefpaf commented 6 years ago

@mwengren as OHW18 approaches @emiliom, @rsignell-usgs, and I are putting together some notebooks to demonstrate how IOOS catalog works. However, we are concerned that we cannot find ERDDAP:tabledap server endpoints with a csw search.

In the example below we search for sea_surface_temperature using the SECOORA ERDDAP server just to ensure that data exists, then we try to find it using the catalog:

http://nbviewer.jupyter.org/gist/ocefpaf/c5e8e5ff79ce1c419575185d756479bb

Is it possible to find the tabledap endpoint in that fashion? We did find griddap BTW.

PS: on an unrelated issue we continue to find the ucsd_cdip dataset no matter what bounding box we use.

mwengren commented 6 years ago

@brianmckenna @benjwadams Can you take a look at the pycsw database and see how the ERDDAP-tabledap resources are represented and might be queried?

@ocefpaf At one point I had a need to query for service types to run compliance checks against and I wrote this utility: https://github.com/mwengren/catalog-query. It uses the CKAN API to filter by a variety of criteria you can specify in the command params. Anyway, the point of mentioning that it is that the ERDDAP-tabledap endpoints are definitely clearly represented both the the CKAN website and API. Here's an example set of attributes for a 'Resource' of type ERDDAP-tabledap my script dumps out:

{
    "cache_last_updated": null,
    "cache_url": null,
    "created": "2018-05-03T08:27:17.834363",
    "description": "ERDDAP's tabledap service (a flavor of OPeNDAP) for tabular (sequence) data. Add different extensions (e.g., .html, .graph, .das, .dds) to the base URL for different purposes.",
    "format": "ERDDAP-TableDAP",
    "hash": "",
    "id": "7fca80ef-a93c-43c3-9e48-bd16e62ac6c2",
    "last_modified": null,
    "mimetype": null,
    "mimetype_inner": null,
    "name": "ERDDAP-tabledap",
    "package_id": "6887f9d2-c8b6-4335-96f3-23b7340079bf",
    "position": 4,
    "resource_locator_function": "download",
    "resource_locator_protocol": "ERDDAP:tabledap",
    "resource_type": null,
    "revision_id": "2d1e4348-f21e-4d78-a99d-ef8bb0315768",
    "size": null,
    "state": "active",
    "url": "http://erddap.secoora.org/erddap/tabledap/gov_usgs_waterdata_023060013",
    "url_type": null
}

So I filtered them out using these criteria in the command line to look for ERDDAP-tabledap in the 'name' attribute:

catalog-query -c https://data.ioos.us/api/3 -a resource_cc_check -q=name:SECOORA,resource_name:ERDDAP-tabledap

Don't run that as it'll swamp the SECOORA ERDDAP with compliance checker requests, but I just mention it to show at least at the CKAN database level the tabledap endpoints are identifiable. @benjwadams or @brianmckenna will have to look at the CS-W database to see what happens at that stage.

It may be that pycsw is not retaining the ERDDAP resources in its database when it's copied from the CKAN database.... or another possibility is our pycsw db update process is broken, I think SECOORA added the ERDDAP dataasets to the Catalog pretty recently.

ocefpaf commented 6 years ago

Don't run that as it'll swamp the SECOORA ERDDAP with compliance checker requests, but I just mention it to show at least at the CKAN database level the tabledap endpoints are identifiable.

Good to know. Thanks!

It may be that pycsw is not retaining the ERDDAP resources in its database when it's copied from the CKAN database.... or another possibility is our pycsw db update process is broken, I think SECOORA added the ERDDAP dataasets to the Catalog pretty recently.

It is not only SECOORA BTW, we cannot find ERDDAP:tabledap in other regions that are known to have ERDDAP servers set as well. (But the reason for not finding them may be different.)

rsignell-usgs commented 6 years ago

We really need to get this figured out. We've been promoting ERDDAP for sensor data and yet we can't find any ERDDAP sensor data via CSW!

Here's a really simple notebook that illustrates the problem: https://gist.github.com/rsignell-usgs/487d51feea270a50b5d0580c3e4de5b5

As you can see in cell [6], the ERDDAP-acquired metadata doesn't have the service endpoints set correctly as the scheme is strangely set to order or .html instead of ERDDAP:TableDap:

title:PRJC1 Long Beach Pier J, CA - 9410665
identifier:noaa_nos_co_ops_prjc1
modified:2018-07-25
[{'scheme': 'order',
  'url': 'http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.html'},
 {'scheme': 'order',
  'url': 'http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.graph'},
 {'scheme': '.html',
  'url': 'http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1'},
 {'scheme': '.html',
  'url': 'http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1'}]

The responses should look more like the ones for this THREDDS-aquired metadata:

title:NECOFS (FVCOM) - Hampton - Latest Forecast
identifier:hampton_nocache
modified:2018-05-02
[{'scheme': 'WWW:LINK',
  'url': 'http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc.html'},
 {'scheme': 'WWW:LINK',
  'url': 'http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc'},
 {'scheme': 'OPeNDAP:OPeNDAP',
  'url': 'http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc'},
 {'scheme': 'OGC:WMS',
  'url': 'http://www.smast.umassd.edu:8080/ncWMS2/wms?service=WMS&version=1.3.0&request=GetCapabilities'},
 {'scheme': 'file',
  'url': 'http://www.smast.umassd.edu:8080/thredds/fileServer/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc'}]
benjwadams commented 6 years ago

I'm actively looking at this to see whether or not the issue is arising at the CKAN level.

rsignell-usgs commented 6 years ago

@benjwadams , excellent, I'm glad you are on the job!

Hopefully the fix is straightforward, and will enable us to generate some nice catalog/erddap notebook examples for Ocean Hack Week!

mwengren commented 6 years ago

From what I can tell, CKAN is parsing out the ERDDAP-tabledap resources properly. My catalog-query script can parse them via the CKAN API, and you can filter them easily on the website:

https://data.ioos.us/organization/09cf7d59-3604-44f7-9c2c-5909d9705e40?res_format=ERDDAP-TableDAP

And in a source record example from SECOORA there's clearly a section with the ERDDAP:tabledap label in the metadata:

<gmd:protocol>
        <gco:CharacterString>ERDDAP:tabledap</gco:CharacterString>
</gmd:protocol>

Don't know if we can say the same about pycsw, based on Rich's results. How does pycsw generate those scheme tags? I'd check the pycsw 'records' table as well @benjwadams. My 2c is that @lukecampbell added some CKAN code at some point to parse the ERDDAP:tabledap and ERDDAP:griddap labels and maybe for whatever reason pycsw isn't doing the same.

If we can't figure it out we maybe we need to ping Tom or Angelos.

benjwadams commented 6 years ago

@rsignell-usgs , @mwengren

This is related to how the PyCSW Python application is parsing out the links/references. In the default PyCSW mapping, PyCSW stores everything for each CSW record in a flat table. The info for the links are stored in the 'links' column (text type) of the 'records' table for my setup.

PyCSW will fetch the links here:

https://github.com/geopython/pycsw/blob/d0089732c5313f0aa376c57c34f47a9b7011d7d1/pycsw/ogc/csw/csw2.py#L1504-L1514

Roughly, this splits the references on the '^' character, and then further splits on ',' characters. The scheme is taken as the third field in the split commas, and the url is taken as the last field. The problem is arising with ERDDAP links because in the link description, there are embedded commas. Here is an example entry from one of

Data Subset Form,ERDDAP's version of the OPeNDAP .html web page for this dataset. Specify a subset of the dataset and download the data via OPeNDAP or in many different file types.,order,http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.html

I didn't particularly want to run through the debugger on the production server, so instead I queried the database with a fairly equivalent SQL query to what the aforemention python code is doing for the ERDDAP endpoint listed:

pycsw=# WITH scheme_table AS (SELECT regexp_split_to_table(links, E'\\^') scheme_section from records where title ilike 'PRJC1 Long Beach Pier J, CA - 9410665'),
    split_scheme AS (SELECT scheme_section, regexp_split_to_array(scheme_section, ',') split_arr from scheme_table)
    select scheme_section, split_arr[3] scheme, split_arr[array_length(split_arr, 1)] from split_scheme;
-[ RECORD 1 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | Data Subset Form,ERDDAP's version of the OPeNDAP .html web page for this dataset. Specify a subset of the dataset and download the data via OPeNDAP or in many different file types.,order,http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.html
scheme         | order
split_arr      | http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.html
-[ RECORD 2 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | Make-A-Graph Form,ERDDAP's Make-A-Graph .html web page for this dataset. Create an image with a map or graph of a subset of the data.,order,http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.graph
scheme         | order
split_arr      | http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1.graph
-[ RECORD 3 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | ERDDAP-tabledap,ERDDAP's tabledap service (a flavor of OPeNDAP) for tabular (sequence) data. Add different extensions (e.g., .html, .graph, .das, .dds) to the base URL for different purposes.,ERDDAP:tabledap,http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1
scheme         |  .html
split_arr      | http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1
-[ RECORD 4 ]--+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | OPeNDAP,An OPeNDAP service for tabular (sequence) data. Add different extensions (e.g., .html, .das, .dds) to the base URL for different purposes.,OPeNDAP:OPeNDAP,http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1
scheme         |  .html
split_arr      | http://erddap.cencoos.org/erddap/tabledap/noaa_nos_co_ops_prjc1

On the other hand, THREDDS related endpoints don't have commas in the description, so the scheme is correct:

pycsw=# WITH scheme_table AS (SELECT regexp_split_to_table(links, E'\\^') scheme_section from records where title = 'NECOFS (FVCOM) - Hampton - Latest Forecast'),
    split_scheme AS (SELECT scheme_section, regexp_split_to_array(scheme_section, ',') split_arr from scheme_table)
    select scheme_section, split_arr[3] scheme, split_arr[array_length(split_arr, 1)] from split_scheme;
-[ RECORD 1 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | File Information,This URL provides a standard OPeNDAP html interface for selecting data from this dataset. Change the extension to .info for a description of the dataset.,WWW:LINK,http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc.html
scheme         | WWW:LINK
split_arr      | http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc.html
-[ RECORD 2 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | Viewer Information,This URL provides an NCDC climate and weather toolkit view of an OPeNDAP resource.,WWW:LINK,http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc
scheme         | WWW:LINK
split_arr      | http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc
-[ RECORD 3 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | OPeNDAP,THREDDS OPeNDAP,OPeNDAP:OPeNDAP,http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc
scheme         | OPeNDAP:OPeNDAP
split_arr      | http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc
-[ RECORD 4 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | OGC-WMS,Open Geospatial Consortium Web Map Service (WMS),OGC:WMS,http://www.smast.umassd.edu:8080/ncWMS2/wms?service=WMS&version=1.3.0&request=GetCapabilities
scheme         | OGC:WMS
split_arr      | http://www.smast.umassd.edu:8080/ncWMS2/wms?service=WMS&version=1.3.0&request=GetCapabilities
-[ RECORD 5 ]--+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
scheme_section | THREDDS_HTTP_Service,THREDDS HTTP Service,file,http://www.smast.umassd.edu:8080/thredds/fileServer/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc
scheme         | file
split_arr      | http://www.smast.umassd.edu:8080/thredds/fileServer/FVCOM/NECOFS/Forecasts/NECOFS_FVCOM_OCEAN_HAMPTON_FORECAST.nc

This is a classic drawback to splitting on string. I think I may have some workarounds that won't break existing code in PyCSW, so I'll create a bug report for the PyCSW folks shortly.

mwengren commented 6 years ago

Thanks @benjwadams for tracking it down! Seems this problem is bigger than us, as I'd feared. Bug report sounds good, although what version of pycsw are we running? We may be a bit behind in the 1.x series. Since I have their handles, I'll add @tomkralidis and @kalxas to this issue so they are aware anyway.

It's also potentially solvable via ERDDAP, although that's probably the less ideal fix since the issue could easily repeat itself depending on the source ISO metadata. If the field changed though for those ERDDAP:tabledap links to remove commas, we'd be OK:

<gmd:CI_OnlineResource>
              <gmd:linkage>
                <gmd:URL>http://erddap.secoora.org/erddap/tabledap/noaa_nos_co_ops_8667259</gmd:URL>
              </gmd:linkage>
              <gmd:protocol>
                <gco:CharacterString>ERDDAP:tabledap</gco:CharacterString>
              </gmd:protocol>
              <gmd:name>
                <gco:CharacterString>ERDDAP-tabledap</gco:CharacterString>
              </gmd:name>
              <gmd:description>
                <gco:CharacterString>ERDDAP's tabledap service (a flavor of OPeNDAP) for tabular (sequence) data. Add different extensions (e.g., .html, .graph, .das, .dds) to the base URL for different purposes.</gco:CharacterString>
              </gmd:description>
              <gmd:function>
                <gmd:CI_OnLineFunctionCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_OnLineFunctionCode" codeListValue="download">download</gmd:CI_OnLineFunctionCode>
              </gmd:function>
</gmd:CI_OnlineResource>

That would require @BobSimons to patch in the next ERDDAP release and of course all our providers to update. Not going to happen soon.

Maybe there's a short-term hack we can do to fix things in the meantime....

rsignell-usgs commented 6 years ago
  <gmd:description>
        <gco:CharacterString>ERDDAP's tabledap service (a flavor of OPeNDAP) for tabular (sequence) data. Add different extensions (e.g., .html, .graph, .das, .dds) to the base URL for different purposes.</gco:CharacterString>
  </gmd:description>

@BobSimons (or @benjwadams) where is this tabledap description coming from?
(I didn't find it in the ERDDAP source code at https://github.com/BobSimons/erddap)

BobSimons commented 6 years ago

It's in EDDTable, but broken into chunks. Search for: ERDDAP's tabledap service (a flavor of OPeNDAP)

On Fri, Jul 27, 2018 at 11:41 AM, Rich Signell notifications@github.com wrote:

ERDDAP's tabledap service (a flavor of OPeNDAP) for tabular (sequence) data. Add different extensions (e.g., .html, .graph, .das, .dds) to the base URL for different purposes. @BobSimons (or @benjwadams ) where is this tabledap description coming from? (I didn't find it in the ERDDAP source code at https://github.com/BobSimons/erddap) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The opinions in this message are mine personally and do not necessarily reflect any position of the U.S. Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs commented 6 years ago

@BobSimons I searched for "flavor" and it didn't come up:

https://github.com/BobSimons/erddap/search?q=flavor&unscoped_q=flavor

2018-07-27_15-51-07

mwengren commented 6 years ago

Those metadata (like the one I linked to in a gist above) are taken directly from the ERDDAP WAFs by Catalog, so as @BobSimons says the ERDDAP codebase is what is generating those description sentences. Where though, I'm not sure...

I had meant to talk to you at some point Bob about how we could harmonize if necessary ncISO's approach to generating ISO XML with ERDDAP's but haven't had the time to get into that. Maybe this issue coming up is a good lead in to that.

BobSimons commented 6 years ago

Harmonizing with ncISO is on my list of things to do.

On Fri, Jul 27, 2018 at 1:32 PM, Micah Wengren notifications@github.com wrote:

Those metadata (like the one I linked to in a gist above) are taken directly from the ERDDAP WAFs by Catalog, so as @BobSimons https://github.com/BobSimons says the ERDDAP codebase is what is generating those description sentences.

I had meant to talk to you at some point Bob about how we could harmonize if necessary ncISO's approach to generating ISO XML with ERDDAP's but haven't had the time to get into that. Maybe this issue coming up is a good lead in to that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ioos/catalog/issues/62#issuecomment-408531325, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOMrsqLRGHZ0uLr39tdYF8CBJ3Hj2ks5uK3j2gaJpZM4Vgd-5 .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The opinions in this message are mine personally and do not necessarily reflect any position of the U.S. Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

benjwadams commented 6 years ago

Update: I have a hotfix coming along which adds some character escaping rules. I am hopeful that it will fix some of the scheme issues.

benjwadams commented 6 years ago

@rsignell-usgs, @ocefpaf, I pushed up a hotfix to production, based off of https://github.com/benjwadams/pycsw/commit/24f96822b6b6448781d8d8b6f5b1789d5f463bb7. I can't find the PRJC1 dataset in catalog anymore for some reason, but please try querying against another dataset in the catalog that has ERDDAP endpoints. Hopefully the scheme should be fixed. Please test and report results and hopefully things will work properly.

rsignell-usgs commented 6 years ago

Yes, I see them! https://gist.github.com/rsignell-usgs/487d51feea270a50b5d0580c3e4de5b5

do the other endpoints look okay?

ocefpaf commented 6 years ago

@rsignell-usgs was faster than me but here is the notebook I used before:

http://nbviewer.jupyter.org/gist/ocefpaf/d8e12f30a7f1e62471ec7f5f6617c5f4

In cell [11] we can see one ERDDAP:tabledap for a glider. However, we still need to figure out why why cannot find datasets from http://erddap.secoora.org/erddap

This is probably unrelated to the pycsw issue that @benjwadams just fixed.

benjwadams commented 6 years ago

@ocefpaf, could you create a separate issue for that? The records in question don't seem to exist in the PyCSW database, whereas this issue dealt with records that did exist but returned the improper scheme.

ocefpaf commented 6 years ago

@ocefpaf, could you create a separate issue for that?

Sure but the original issue I created was exactly about finding the SECOORA endpoints with a catalog search. We found 1 of the issues and you fixed that in pycsw but the main issue remains :smile:

So... Do we really need a new issue that will be a copy-n-paste of https://github.com/ioos/catalog/issues/62#issue-344532564 ?

benjwadams commented 6 years ago

@ocefpaf, my omission, sorry. I kinda got tunnel vision tracking down the cause of the incorrect scheme in the references. Nonetheless, for bookkeeping purposes, it'd be easiest to track down in the newer issue.

ocefpaf commented 6 years ago

No problem. @rsignell-usgs created #64 so let's move the discussion there.