ioos / registry

Getting data services registered in the IOOS Service Registry
http://ioos.github.io/registry/
2 stars 7 forks source link

Python tools for translating THREDDS metadata to PyCSW formats #45

Open robragsdale opened 9 years ago

robragsdale commented 9 years ago

@danramage and @pacioos Dan Ramage is implementing a PyCSW server and asked if anyone had any python tools that would make it less painful to go from the THREDDS metadata to one of the formats PyCSW supports for importing. There is a thread in the PyCSW issues page about this, but Dan didn't see a conclusion to it.

Dan

robragsdale commented 9 years ago

The ncISO portion of THREDDS can output ISO 19115-2 metadata files. Have you tried to use these inpycsw? If they work, you could use: https://github.com/kwilcox/thredds_crawler to download the ISO files you want to ingest into pycsw. A direct connector in pycsw would also be a good idea...

Kyle

robragsdale commented 9 years ago

I saw there was the command line java piece, so I am trying that and it works. Let me back up, my definition of works is I see an entry in the database after I created the xml metadata file then pointed pycsw to ingest it. I am not sure if all the fields are populated in a useful manner yet as I am learning what the schema for pycsw is at the same time.

Dan

rsignell-usgs commented 9 years ago

Also discussed here: https://github.com/geopython/pycsw/issues/155

robragsdale commented 9 years ago

Dan,

Wasn't sure you were referring to Java ncISO in your original post. NERACOOS runs this nightly against 2 of our TDS catalogs and produces a WAF. http://www.neracoos.org/WAF/

The ISO files are at: http://www.neracoos.org/WAF/UMaine/iso/ http://www.neracoos.org/WAF/BIO/iso/

AFAIK the ncISO crawler produces valid ISO files via the ncISO TDS plugin and the NGDC Geoportal ingests it successfully.

I posted a simple python 2.7 script we use to run this via cron.

https://github.com/neracoos-open/neracoos_catalog/tree/master/src/MetadataWAF

It has a rename option since we are running ncSOS which requires a ncml extension to overcoming a TDS aggregation cache issue, but that is optional.

The TDS catalog url's are: http://www.neracoos.org/thredds/UMO_SOS_historical_realtime_agg.html and http://www.neracoos.org/thredds/catalog/WW3/catalog.html

Hope this helps.

Eric

robragsdale commented 9 years ago

Hi Dan, PyCSW imports ISO XML files directly using the pycsw-admin.py utility that comes with it. You just point it to your WAF. And maintaining a WAF is as simple as wget'ting all your ncISO end points from TDS (as Eric expounded on). Not a lot of extra glue needed, but here is my PyCSW loading script to give you an idea: https://www.dropbox.com/s/w36vmgttbn64t7s/update_pycsw.py?dl=0. PyCSW drives our data search page here: http://pacioos.org/search/. Other details regarding our metadata and WAFs here: http://pacioos.org/metadata/ Cheers, John Maurer, PacIOOS

robragsdale commented 9 years ago

John,

Thanks for the info, I was actually playing around on your search page before I started working with pycsw. It's good to see someone using the ncISO in production, that gave me confidence it provides useful metadata to prime the catalog with. In your results, you provide the access methods. Where is that coming from? I ran ncISO against a THREDDS endpoint and expected to seepycsw populate the links column, however nothing was there.

Hi Dan, The access methods should be captured by pycsw. Originally pycsw was missing the xml elements that ncISO uses to populate these, but Rich Signell and I submitted the issue to pycsw and Tom Kralidis added support. Seehttps://github.com/geopython/pycsw/issues/238 for details. Are you sure you have the latest pycsw?, and that your ncISO output contains gmd:distributorTransferOptions and/or srv:SV_ServiceIdentification elements? Cheers, John

John,

Those seem to be populated. Is there a specific column(s) they go into in thepycsw schema, or are you parsing them out of the XML?

Dan

Within the pycsw database, they go in the "links" column as you stated earlier. If you are running CSW commands to query your database, where you parse the XML response depends on which outputSchema you are requesting. In the CSW schema you look at dct:references, in the gmd schema (ISO) you look for gmd:URL under gmd:distributorTransferOptions and srv:SV_ServiceIdentification. (I do the latter so I can also pull gmd:Name.) John

robragsdale commented 9 years ago

Please be aware that distribution links in the ISO metadata can be available from two different Xpaths in the distribution section of ISO. NcISO always outputs the links at this Xpath: gmd:distributionInfo/gmd:MD_Distribution/gmd:distributor/gmd:MD_Distributor/gmd:distributorTransferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine/gmd:CI_OnlineResource But you may stumble across ISO records that also have links at this Xpath: gmd:distributionInfo/gmd:MD_Distribution/gmd:transferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine/gmd:CI_OnlineResource Anna

robragsdale commented 9 years ago

Anna, thanks for the info. I think this might turn out to be the real issue. Searching through the XML, while I do see OGC-WMS for instance, it is wrapped in a gmd:identificationInfo hierarchy.

Dan

Hi Dan,

Maybe I’m not entirely following this thread, but figure I’ll chime in as I’ve been working on issues like what you are talking about for quite a while.

This is normal and intended behavior. ‘service’ endpoints that are not meant for humans to navigate to are in srv:serviceIdentification elements. ‘links’ are in the gmd:distributionInfo element.

I think pyCSW is built on OWSLib? We worked with the OWSLib developers a while back to make sure that srv:serviceIdentification, as generated by ncISO was parsed into their iso metadata python object. If its currently not supported by pyCSW, it shouldn’t be too difficult to add.

Dave

Dave,

I think it is in pyCSW now, however I'm wondering if something is missing at the THREDDS endpoint I am testing with. I may have the wrong branch, as I pulled from the master at GitHub. The xml file I am testing with: http://gsaaportal.org/media/metadata/xml/SABGOM_Forecast_Model_Run_Collection_best_ISO.xml does have a gmd:distributionInfo element, but the links column is not being populated from it.

Dan

Glancing at this record, it looks like a normal ncISO record.

DistributionInfo includes a link to the OPeNDAP service .html page and the weather and climate toolkit link to view the dataset with that.

If you have a look here: https://github.com/geopython/OWSLib/blob/master/owslib/iso.py#L464

It appears that the code is looking in: gmd:distributionInfo/gmd:MD_Distribution/gmd:transferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine/gmd:CI_OnlineResource

Not, gmd:distributionInfo/gmd:MD_Distribution/gmd:distributor/gmd:MD_Distributor/gmd:distributorTransferOptions/gmd:MD_DigitalTransferOptions/gmd:onLine/gmd:CI_OnlineResource

As it should be.

Shouldn’t be a hard fix to get in there.

Dave

My issue was all on my end. The python virtual environment I was using on the server was the cuprit. Not exactly sure what the issue was. I built a new 2.7.8 environment and added the requirements and no I can import links.

Thanks for the feedback.

Dan