internetofwater / nldi-crawler

Network Linked Data Index Crawler
https://labs.waterdata.usgs.gov/about-nldi/
Other
6 stars 9 forks source link

Rewrite crawler in more accessible language. #193

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

I want to write a dockerized version of the crawler in R so I can contribute crawler code myself. Having a pattern for both python and R would be really nice. I doubt it would be too heavy a lift but need to do a little research on how it would work out.

EthanGrahn commented 1 year ago

The logic for either will be simple as long as you have a SQL library to use. After retrieving the GeoJSON file, the important part is the SQL on the src/main/resources/mybatis directory. Those files have the PostGIS logic for different ingestion types.

dblodgett-usgs commented 1 year ago

Right on -- I imagine it will be pretty straight forward to re-implement and execute via an R or python CLI. I need to figure out how to trigger the run with the same hook, but that can't be that hard?

EthanGrahn commented 1 year ago

I'd recommend implementing it in Python. Last I knew, the WMA's policies they are putting together mandate a set of languages for apps and Python is on that list but R is not. The current dockerfile passes the env var as a command line variable, but you could honestly just check for the environment variable at the start of the script and you'll get the same effect.

dblodgett-usgs commented 1 year ago

Huh -- that seems problematic... I'll have to find out what the parameters on that kind of mandate are. Thanks.

EthanGrahn commented 1 year ago

My understanding is that it's to help ease developer project switching, operations knowledge scope, and hiring new devs. This is also specific to deployed apps. R can still be used for other situations.

gzt5142 commented 1 year ago

RE: A Python implementation

Some notes on how I would move forward on a python port of this repo...

SQL:

Spring Framework:

Triggers:

Porting Data Structures and Algorithms:

Dev environment:

dblodgett-usgs commented 1 year ago

This sounds great. I Wasn't aware that spring and mybatis were actually used in the crawler. My design priority with this is to make it as blatantly procedural and intuitive as is reasonable to ensure a future developer can see exactly what it's doing without needing to traverse an object structure and figure out any abstraction.

I'd say carry on with the experiment in a fork. Maybe just implement one of the two crawler methods as a first cut?

gzt5142 commented 1 year ago

I'd say carry on with the experiment in a fork.

Forked to https://github.com/gzt5142/nldi-crawler-py . I'll update this issue in the upstream repo as I reach milestones on that fork.

--gt

gzt5142 commented 1 year ago

I'm finding that the source table has some rotten links in it. I'm pulling the crawler_sources table from the demo database. The URIs contained within are not all good:

gzt5142 commented 1 year ago

The read timeout for the original Java crawler is 15 seconds. Can up this to perhaps overcome the slow response from servers. Recommend against setting it over 30 seconds.

dblodgett-usgs commented 1 year ago

Interesting finding. We should identify some core test data. These are all the big production datasets that don't really need to be used for testing. Do you have any that are working as expected right now?

gzt5142 commented 1 year ago

Do you have any that are working as expected right now?

Yes. Most are behaving as expected.

The redirects mentioned above are not super serious (we can get at the data after redirect), but who knows how long the server redirect will persist. At some point, the link will 404.

I've also since learned that some of the metadata in the crawler_sources table is out of sync with the feature data returned from a source (column names being the most difficult to reconcile).

I have the data I need to do unit and integration testing, and even some end-2-end for some of the datasets in the crawler_source table. But it does seem that we should review the source table for current/correct/usable data. I think I'm going to put in an extra functionality to validate sources (data returns, column names are correct, etc) without actually updating the production database. Would be easy to implement and would let us know about source data issues easily.

gzt5142 commented 1 year ago

I think I have all of the functional pieces working on the bench (i.e. in a jupyter notebook), except for one piece relating features to NHD+ data. I'll work to have these functional pieces into a usable CLI before our next checkin.

gzt5142 commented 1 year ago

Some examples of current function for the command nldi-cli as currently implemented are available in the usage document over at the active fork:

https://github.com/gzt5142/nldi-crawler-py/blob/python-port/docs/usage.md

A couple of thoughts on validating sources....

> nldi-cli validate 10
Checking Vigil Network Data...  [FAIL] : Column not found for 'feature_reach' : nhdpv2_REACHCODE

> nldi-cli download 10
Source 10 downloaded to /home/trantham/nldi-crawler-py/CrawlerData_10_w2_08yh5.geojson

> nldi-cli validate 1
Checking Water Quality Portal...  [FAIL] : Network Timeout

> nldi-cli   validate 2
Checking HUC12 Pour Points...  [FAIL] : Invalid JSON

> nldi-cli validate 13
Checking geoconnex contribution demo sites...  [PASS]

By accident (I assume it is inadvertent), the demo database from nldi-db contains a crawler_source table with some unusable entries. This provides some useful test cases to be sure we trap for them:

dblodgett-usgs commented 1 year ago

Right on! Interesting that the demo database has the invalid sources. Are they also invalid in https://github.com/internetofwater/nldi-db/blob/master/liquibase/changeLogs/nldi/nldi_data/update_crawler_source/crawler_source.tsv ??

Do you think we are at a place that we should try wiring this up in docker and testing it out on the dev NLDI instance?

gzt5142 commented 1 year ago

Are they also invalid in ...

Yes. Errors in the tsv also. I suspect that the source services just changed their output.. we should change the table to adapt to those changes.

I have one major hurdle before I'm ready to try on a dev instance for our first end-to-end test: The plumbing to 'connect' ingested features to the NHD basins in the NLDI database. I'll be able to sort that out this week (fingers crossed), then we can talk about what it would look like to run a test against dev.