Deploy service from Tier3 CSV file with link from harvested metadata

smrgeoinfo commented 9 years ago

see #610 for background When a harvested record indicates that a Tier3 CSV data set is available for a resource, get the csv, validate it, and if valid, deploy an NGDS web service. The technical requirement are described in a document that is in the ngds/documents repository: https://github.com/ngds/documents/blob/master/GDR_integrationProjectRequirements.docx

The requirements for identifying the correct distribution don't account for multiple distribution options; and should have identified that a usgin: content model keyword would be present. We need to review what the application is looking for in the metadata record to identify the correct distribution link to get the csv file. @dano-reisys can you get that info?

dano-reisys commented 9 years ago

as far as I know, harvested datasets go through the validation processe if they are tagged with a usgin keyword. @ydave-reisys correct me if i am wrong.

ydave-reisys commented 9 years ago

@smrazgs Thats correct we check if dataset is tagged with 'usgincm:' keyword from the provided list of keywords for all content models (https://github.com/usgin-models/exchangecatalog/blob/master/keywordsNamespacesContentModel.csv) and then proceed for the validation.

smrgeoinfo commented 9 years ago

The ISO XML metadata might have multiple DigitalTransferOptions, for different 'distributions'-- probably at least an Excel spreadsheet and a CSV version of the same file. How does the code identify the correct 'distribution' link/

ydave-reisys commented 9 years ago

@smrazgs It looks for csv format.

ccaudill commented 9 years ago

@ccaudill I looked at all datasets in http://services.stategeothermaldata.org/geoAL/csw which has only 2 datasets with csv files. As per requirement if the dataset confirms to one of Ngds models then only CSV resources are accepted. Both datasets Nevada Power Plant Facilities test and Alabama Well Logs test have xls resource alongwith csv hence both datasets are not harvested.

@smrazgs Perhaps you should comment on this too, but a valid CSV file should be harvested and published, regardless of what other distributions (file types) are in the harvested metadata. @ydave-reisys @dano-reisys

ccaudill commented 9 years ago

http://repository.stategeothermaldata.org/metadata/record/eaf12e0c53a4222440a8b343a21546f6.iso.xml There is an example of a metadata record that we're trying to harvest in, and publish the CSV.

ccaudill commented 9 years ago

@dano-reisys The CSW at http://services.stategeothermaldata.org/geoAL/csw now has 12 records, including 2 test resources with unpublished CSV files (the metadata for those include the elements in the metadata as specified at https://github.com/ngds/documents/blob/master/Tier3-csv-DistributionLink_inISO19139.docx). @smrazgs

This CSW is ready for your harvesting testing.

ccaudill commented 9 years ago

New rpm today. This comment is to document the first test of harvesting in CSV files for automatic publishing. The CSV files evidently did not conform to the schema, and were thus not published. Errors were given, good specific ones, so that's what we'd want to happen. I'll give a screenshot below, correct the files, then try for another test harvest. harvestingerror

ccaudill commented 9 years ago

It looks as though the automatic publishing did not work. This record did not error, and is valid, but just harvesting in the CSV and did not publish the service: http://test.geothermaldata.org/dataset/nevada-power-plant-facilities-test THis is from the harvest: http://test.geothermaldata.org/harvest/ala @dano-reisys @smrazgs @kvuppala

dano-reisys commented 9 years ago

Is geoserver running? has it been configured?

After all the steps please update two config files. Update "ckan.hostname" & "ngds.aggregator_url" (no trailing slash) variables with correct URLs in file /etc/ckan/production.ini, and update proxyBaseUrl in file /var/lib/tomcat6/webapps/geoserver/data/global.xml, replacing 127.0.0.1 with correct URL. Restart server after update.

I can't seem to get to: http://test.geothermaldata.org/geoserver-srv/web/

ccaudill commented 9 years ago

Thanks @dano-reisys Yes, this has been done on test.geothermaldata.org/geoserver. Here is the URL in the GeoServer global.xml file:

http://test.geothermaldata.org/geoserver-srv/

smrgeoinfo commented 9 years ago

@ccaudill , I get a 404 error when I try that link...

smrgeoinfo commented 9 years ago

can we try to deploy the csv manually to see if that works?

ccaudill commented 9 years ago

Yes, it does: http://test.geothermaldata.org/dataset/test-well-logs-publish I just did this one.

ccaudill commented 9 years ago

I tried another harvest after checking the metadata records, which were custom-made to make sure they had the elements as Steve outlined were needed for this task; USGIN keyword, url to csv, applicationProfile string (content model namespace), and the name of the file:

       <gmd:descriptiveKeywords>
            <gmd:MD_Keywords>
                <gmd:keyword>
                    <gco:CharacterString>usgincm:well log observation</gco:CharacterString>
                </gmd:keyword>
                <gmd:type>
                    <gmd:MD_KeywordTypeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/Codelist/gmxCodelists.xml#MD_KeywordTypeCode" codeListValue="theme">theme</gmd:MD_KeywordTypeCode>
                </gmd:type>
            </gmd:MD_Keywords>
        </gmd:descriptiveKeywords>
              <gmd:MD_DigitalTransferOptions>
                 <gmd:onLine>
                    <gmd:CI_OnlineResource>
                       <gmd:linkage>
                          <gmd:URL>http://url to get csv file</gmd:URL>
                       </gmd:linkage>
     <gmd:applicationProfile>
                          <gco:CharacterString>http://stategeothermaldata.org/uri-gin/aasg/xmlschema/welllog/0.8</gco:CharacterString>
                       </gmd:applicationProfile>
                       <gmd:name>
                            <gco:CharacterString>NGDS Tier 3 Data, csv format: nmwelllog.csv</gco:CharacterString>
                        </gmd:name>
                       <gmd:function>
                          <gmd:CI_OnLineFunctionCode codeList="http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#CI_OnlineFunctionCode" codeListValue="download">download</gmd:CI_OnLineFunctionCode>
                       </gmd:function>
                    </gmd:CI_OnlineResource>
                 </gmd:onLine>
              </gmd:MD_DigitalTransferOptions>

These still did not get published. See that the metadata records, after being harvested into CKAN did NOT inherit the gmd:applicationProfile element: http://test.geothermaldata.org/metadata/iso-19139/71beeb4c-d551-4ccd-b2ee-ed41279fd5ef.xml But it definitely is in the metadata record which was harvested in from a Geoportal: ftp://AZGS:sharefiles@secureftp.azgs.az.gov/AZGS/ccaudill/NVPowerPlantFacilities-testMetadata.xml ftp://AZGS:sharefiles@secureftp.azgs.az.gov/AZGS/ccaudill/NMWellLogs-testMetadata.xml @smrazgs @dano-reisys @kvuppala

ccaudill commented 9 years ago

Thank you @dano-reisys - great work. Looks like the auto publishing is working and finished up:

http://test.geothermaldata.org/dataset/nevada-power-plant-facilities-test http://test.geothermaldata.org/harvest/ala

ngds / ckanext-ngds-bku03232018

Deploy service from Tier3 CSV file with link from harvested metadata #625