Improve crawler source workflow.

dblodgett-usgs commented 2 years ago

Currently, the crawler source TSV file is cumbersome and adding a new crawler source requires a fairly heavy database operation.

We've always imagined having a UI of some kind that allowed registration of new crawler sources.

Let's work in that direction by implementing a stand alone crawler source json object for each source that can be validated and may grow over time per #63 -- a github action could then be set up to test the json objects and create the crawler source tsv file?

In a future sprint, we could build a UI around the json objects such that the contribution model is no longer through a PR but through some kind of interface.

EthanGrahn commented 2 years ago

If we take the route of JSON files representing each source, it would be an option to have the crawler update the crawler source table at the start of its run rather than requiring a liquibase execution first. Having the crawler manage its own table makes sense to me and avoiding table data loading with liquibase should help simplify db management long term. (manage schema, not data)

dblodgett-usgs commented 2 years ago

Good call -- I think this would be a big improvement. Let's keep an eye on it. I'd meant to fit this in last year and it didn't happen. It's for sure a priority.

EthanGrahn commented 2 years ago

Example JSON file for a single source:

{
    "crawlerSourceId" : 5,
    "sourceName" : "NWIS Surface Water Sites",
    "sourceSuffix" : "nwissite",
    "sourceUri" : "https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=usgs_nldi_gages.geojson",
    "featureId" : "provider_id",
    "featureName" : "name",
    "featureUri" : "subjectOf",
    "featureReach" : "nhdpv2_REACHCODE",
    "featureMeasure" : "nhdpv2_REACH_measure",
    "ingestType" : "reach",
    "featureType" : "hydrolocation"
}

The key names map directly to the database columns, but we could simplify them as needed.

EthanGrahn commented 2 years ago

Here's a first draft of a schema to validate the crawler source files. Let me know what you think and any suggestions for better descriptions/names @dblodgett-usgs It has very loose validation with most of it being character limits that match the database table.

{
    "$schema": "http://json-schema.org/draft/2020-12/schema",
    "title": "Crawler Source",
    "description": "A source from which the Crawler can ingest features.",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for the source",
            "type": "integer",
            "minimum": 0,
            "maximum": 2147483647
        },
        "name": {
            "description": "A human readable name for the source",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,500}$"
        },
        "suffix": {
            "description": "Unique suffix for database and service use",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,1000}$"
        },
        "uri": {
            "description": "Source location to download GeoJSON features",
            "type": "string",
            "pattern": "^.{1,256}$"
        },
        "feature": {
            "description": "Metadata of the features",
            "type": {
                "$ref": "#/$defs/feature"
            }
        },
        "ingestType": {
            "description": "Method used to index feature",
            "type": "string",
            "pattern": "^(reach|point)$"
        }
    },
    "required": [
        "id",
        "name",
        "suffix",
        "uri",
        "feature",
        "ingestType"
    ],
    "$defs": {
        "feature": {
            "type": "object",
            "required": [
                "id",
                "type",
                "name",
                "uri"
            ],
            "properties": {
                "id": {
                    "type": "string",
                    "description": "Key name that maps to the ID of the feature",
                    "pattern": "^.{1,500}$"
                },
                "type": {
                    "type": "string",
                    "description": "Associated location type for this feature",
                    "pattern": "^(hydrolocation|type|varies)$"
                },
                "name": {
                    "type": "string",
                    "description": "Key name that maps to the name of the feature",
                    "pattern": "^.{1,500}$"
                },
                "uri": {
                    "type": "string",
                    "description": "Key name that maps to the URI of the feature",
                    "pattern": "^.{1,256}$"
                },
                "reach": {
                    "type": "string",
                    "description": "Key name that maps to the reachcode of the feature",
                    "pattern": "^.{1,500}$"
                },
                "measure": {
                    "type": "string",
                    "description": "Key name that maps to the measure of the feature",
                    "pattern": "^.{1,500}$"
                }
            }
        }
    }
}

EthanGrahn commented 2 years ago

Optionally, we could validate the source URI with a HEAD request to check if we get a 200 response.

dblodgett-usgs commented 2 years ago

I like that idea. I think we should go ahead and get this implemented.

internetofwater / nldi-services

Improve crawler source workflow. #281