internetofwater / nldi-crawler

Network Linked Data Index Crawler
https://labs.waterdata.usgs.gov/about-nldi/
Other
6 stars 9 forks source link

Only keep records that have been indexed to a catchment/flowline ID. #220

Open dblodgett-usgs opened 1 year ago

dblodgett-usgs commented 1 year ago

Currently, the crawler keeps all records that are read in whether they get indexed or not. The crawler should operate exclusively where it only keeps data that indexes to a comid.

When a crawl finishes, no rows with NULL comids should remain in the NLDI database. This could be made configurable but default to drop un-indexed features.

gzt5142 commented 1 year ago

Will tackle this issue this week.

The prerequisite will depend on getting a fresh copy of the demo database to be sure I'm working against the current standard schema and content.

As we add configuration options, it may be worth discussing how the crawler is invoked. Right now, it is run from the Linux command line, with --option style mechanism for altering default behavior. I wonder if it makes sense to put all configuration into a yml or similar input file.

gzt5142 commented 1 year ago

I think I may have misunderstood what has been happening with the crawler source table.... I just pulled a fresh copy of the nldi-db repo into a pristine docker environment. docker-compose up demo stands up a working database. But the contents of the crawler source table are confusing me:

image

Of specific interest to me are the suffixes and the crawler source ID integers. Were those integers going to be re-ranged starting from 1 and no skips?

gzt5142 commented 1 year ago

I have ported the logic from the java crawler into python. Mostly, this is just arranging different framing around the SQL lifted directly from the java repo.

It is a minor security risk to allow "raw" SQL to execute (injection concerns)... but this code has very limited ability for users to affect the variables, so it is reasonably insulated from such attacks.

In terms of testing -- I was only able to match three features from source 11 ( geoconnex contribution demo sites) against the NHD data in the NHDplus artifact at https://github.com/internetofwater/nldi-db/releases/download/artifacts-2.0.0/

Looking for domain experts to help me understand if that is the expected result. @dblodgett-usgs

I drop all ingested features with COMID=0 after the crawl. This is not optional (yet).