Open dblodgett-usgs opened 1 year ago
Will tackle this issue this week.
The prerequisite will depend on getting a fresh copy of the demo database to be sure I'm working against the current standard schema and content.
comid
for each ingested featurecomid
, we can either drop or keep them:
comid
is empty/nullAs we add configuration options, it may be worth discussing how the crawler is invoked. Right now, it is run from the Linux command line, with --option
style mechanism for altering default behavior. I wonder if it makes sense to put all configuration into a yml
or similar input file.
I think I may have misunderstood what has been happening with the crawler source table.... I just pulled a fresh copy of the nldi-db
repo into a pristine docker environment. docker-compose up demo
stands up a working database. But the contents of the crawler source table are confusing me:
Of specific interest to me are the suffixes and the crawler source ID integers. Were those integers going to be re-ranged starting from 1 and no skips?
I have ported the logic from the java crawler into python. Mostly, this is just arranging different framing around the SQL lifted directly from the java repo.
It is a minor security risk to allow "raw" SQL to execute (injection concerns)... but this code has very limited ability for users to affect the variables, so it is reasonably insulated from such attacks.
In terms of testing -- I was only able to match three features from source 11 ( geoconnex contribution demo sites) against the NHD data in the NHDplus artifact at https://github.com/internetofwater/nldi-db/releases/download/artifacts-2.0.0/
Looking for domain experts to help me understand if that is the expected result. @dblodgett-usgs
I drop all ingested features with COMID=0 after the crawl. This is not optional (yet).
Currently, the crawler keeps all records that are read in whether they get indexed or not. The crawler should operate exclusively where it only keeps data that indexes to a comid.
When a crawl finishes, no rows with NULL comids should remain in the NLDI database. This could be made configurable but default to drop un-indexed features.