Open dblodgett-usgs opened 2 years ago
If we take the route of JSON files representing each source, it would be an option to have the crawler update the crawler source table at the start of its run rather than requiring a liquibase execution first. Having the crawler manage its own table makes sense to me and avoiding table data loading with liquibase should help simplify db management long term. (manage schema, not data)
Good call -- I think this would be a big improvement. Let's keep an eye on it. I'd meant to fit this in last year and it didn't happen. It's for sure a priority.
Example JSON file for a single source:
{
"crawlerSourceId" : 5,
"sourceName" : "NWIS Surface Water Sites",
"sourceSuffix" : "nwissite",
"sourceUri" : "https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=usgs_nldi_gages.geojson",
"featureId" : "provider_id",
"featureName" : "name",
"featureUri" : "subjectOf",
"featureReach" : "nhdpv2_REACHCODE",
"featureMeasure" : "nhdpv2_REACH_measure",
"ingestType" : "reach",
"featureType" : "hydrolocation"
}
The key names map directly to the database columns, but we could simplify them as needed.
Here's a first draft of a schema to validate the crawler source files. Let me know what you think and any suggestions for better descriptions/names @dblodgett-usgs It has very loose validation with most of it being character limits that match the database table.
{
"$schema": "http://json-schema.org/draft/2020-12/schema",
"title": "Crawler Source",
"description": "A source from which the Crawler can ingest features.",
"type": "object",
"properties": {
"id": {
"description": "The unique identifier for the source",
"type": "integer",
"minimum": 0,
"maximum": 2147483647
},
"name": {
"description": "A human readable name for the source",
"type": "string",
"pattern": "^[0-9a-zA-z_-]{1,500}$"
},
"suffix": {
"description": "Unique suffix for database and service use",
"type": "string",
"pattern": "^[0-9a-zA-z_-]{1,1000}$"
},
"uri": {
"description": "Source location to download GeoJSON features",
"type": "string",
"pattern": "^.{1,256}$"
},
"feature": {
"description": "Metadata of the features",
"type": {
"$ref": "#/$defs/feature"
}
},
"ingestType": {
"description": "Method used to index feature",
"type": "string",
"pattern": "^(reach|point)$"
}
},
"required": [
"id",
"name",
"suffix",
"uri",
"feature",
"ingestType"
],
"$defs": {
"feature": {
"type": "object",
"required": [
"id",
"type",
"name",
"uri"
],
"properties": {
"id": {
"type": "string",
"description": "Key name that maps to the ID of the feature",
"pattern": "^.{1,500}$"
},
"type": {
"type": "string",
"description": "Associated location type for this feature",
"pattern": "^(hydrolocation|type|varies)$"
},
"name": {
"type": "string",
"description": "Key name that maps to the name of the feature",
"pattern": "^.{1,500}$"
},
"uri": {
"type": "string",
"description": "Key name that maps to the URI of the feature",
"pattern": "^.{1,256}$"
},
"reach": {
"type": "string",
"description": "Key name that maps to the reachcode of the feature",
"pattern": "^.{1,500}$"
},
"measure": {
"type": "string",
"description": "Key name that maps to the measure of the feature",
"pattern": "^.{1,500}$"
}
}
}
}
}
Optionally, we could validate the source URI with a HEAD
request to check if we get a 200 response.
I like that idea. I think we should go ahead and get this implemented.
Currently, the crawler source TSV file is cumbersome and adding a new crawler source requires a fairly heavy database operation.
We've always imagined having a UI of some kind that allowed registration of new crawler sources.
Let's work in that direction by implementing a stand alone crawler source json object for each source that can be validated and may grow over time per #63 -- a github action could then be set up to test the json objects and create the crawler source tsv file?
In a future sprint, we could build a UI around the json objects such that the contribution model is no longer through a PR but through some kind of interface.