Rewrite crawler in more accessible language.

dblodgett-usgs commented 1 year ago

I want to write a dockerized version of the crawler in R so I can contribute crawler code myself. Having a pattern for both python and R would be really nice. I doubt it would be too heavy a lift but need to do a little research on how it would work out.

EthanGrahn commented 1 year ago

The logic for either will be simple as long as you have a SQL library to use. After retrieving the GeoJSON file, the important part is the SQL on the src/main/resources/mybatis directory. Those files have the PostGIS logic for different ingestion types.

dblodgett-usgs commented 1 year ago

Right on -- I imagine it will be pretty straight forward to re-implement and execute via an R or python CLI. I need to figure out how to trigger the run with the same hook, but that can't be that hard?

EthanGrahn commented 1 year ago

I'd recommend implementing it in Python. Last I knew, the WMA's policies they are putting together mandate a set of languages for apps and Python is on that list but R is not. The current dockerfile passes the env var as a command line variable, but you could honestly just check for the environment variable at the start of the script and you'll get the same effect.

dblodgett-usgs commented 1 year ago

Huh -- that seems problematic... I'll have to find out what the parameters on that kind of mandate are. Thanks.

EthanGrahn commented 1 year ago

My understanding is that it's to help ease developer project switching, operations knowledge scope, and hiring new devs. This is also specific to deployed apps. R can still be used for other situations.

gzt5142 commented 1 year ago

RE: A Python implementation

Some notes on how I would move forward on a python port of this repo...

SQL:

The mappings that @EthanGrahn mentions (using mybatis) are a key concern for the Python port. Mybatis does not have a direct python equivalent (that I can find). I can find some small projects that attempt to translate mybatis xml mappings, but these don't seem to be appropriate for enterprise-quality software. It seems like the best way to achieve similar functionality is with SQLAlchemy https://www.sqlalchemy.org/
SQLAlchemy has a large user base and stable support channels.
SQLAlchemy is new territory for me, so will be a bit of a learning curve to use it to replicate the java/mybatis functionality.
SQLAlchemy has a full Object Relational Mapper (ORM), which is not quite a 1:1 replacement for mybatis. In theory it is not as lightweight as mybatis, but what that means for this project I can't say just yet. I don't expect performance to suffer, but complexity of configuration may be more intense.

Spring Framework:

Spring is not itself available for Python projects.
I have not finished the complete code review to see all the ways Spring is used in this app... it seems like the core functionality will need to be replicated with pieces from the python standard library.
The Spring framework seems to be mainly used to leverage Repository and DAO objects tied to the database connections.
I will need some more research time to figure how the functionality can be best replicated in a python project. I'm a novice Java programmer, so Spring is yet more new territory for me to figure out.
My hope is that the SQLAlchemy python library will be able to provide suitable abstractions.
I notice that Flask has some functionality that might be useful here. Flask's ORM extensions (including sqlalchemy) are worth a look to manage something approximating Spring's @Repository capability. Flask's main purpose is related to web apps, so may not be a good fit here (TBD).

Triggers:

Easily executed in a docker container via any number of mechanisms: github/gitlab action; kbatch submission to a kubernetes cluster; using singularity on the HPC... lots of options.
Could also run as a batch job on existing system... or manual execution in the correct conda environment via command line on HPC or desktop.

Porting Data Structures and Algorithms:

The object-oriented design in Java is likely to translate easily to python classes, with exceptions/complications noted for the Spring-specific decorators used for some classes.
Python support for multi-threading and multi-processing is an option for this project. The bottleneck there will likely be the DB server where processed data is inserted. While an interesting enhancement, I think only worth considering after base functionality.

Dev environment:

The functionality that maven provides for java builds is handled here with nox:
- nox (build manager)
- poetry (environment and dependency management)
- pylint - standard linter
- pytype - type checking
- black - formatting standard
- pytest - testing framework (unit and integration tests)
- sphinx - documentation w/ RTD templating
I propose to fork the nldi-crawler repo to do the first round of experimentation and development without risking the production codebase. Can merge that fork back to production upon reaching functional parity.

dblodgett-usgs commented 1 year ago

This sounds great. I Wasn't aware that spring and mybatis were actually used in the crawler. My design priority with this is to make it as blatantly procedural and intuitive as is reasonable to ensure a future developer can see exactly what it's doing without needing to traverse an object structure and figure out any abstraction.

I'd say carry on with the experiment in a fork. Maybe just implement one of the two crawler methods as a first cut?

gzt5142 commented 1 year ago

I'd say carry on with the experiment in a fork.

Forked to https://github.com/gzt5142/nldi-crawler-py . I'll update this issue in the upstream repo as I reach milestones on that fork.

--gt

gzt5142 commented 1 year ago

I'm finding that the source table has some rotten links in it. I'm pulling the crawler_sources table from the demo database. The URIs contained within are not all good:

Source 1
- https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no
- Read Timeout after 15 seconds
Source 2
- https://www.sciencebase.gov/catalogMaps/mapping/ows/57336b02e4b0dae0d5dd619a?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json
- NullPointer exception on server. No data returned
Source 6
- This source has moved. The server redirects to:
- https://www.hydroshare.org/django_irods/rest_download/5f665b7b82d74476930712f7e423a0d2/data/contents/wade.geojson/?url_download=False&zipped=False&aggregation=False
Source 7
- Also moved. The server redirects to:
- https://www.hydroshare.org/django_irods/rest_download/3295a17b4cc24d34bd6a5c5aaf753c50/data/contents/nldi_gages.geojson/?url_download=False&zipped=False&aggregation=False
Source 12
- https://locations.newmexicowaterdata.org/collections/Things/items?f=json&limit=100000
- Read timeout after 15 seconds

gzt5142 commented 1 year ago

The read timeout for the original Java crawler is 15 seconds. Can up this to perhaps overcome the slow response from servers. Recommend against setting it over 30 seconds.

dblodgett-usgs commented 1 year ago

Interesting finding. We should identify some core test data. These are all the big production datasets that don't really need to be used for testing. Do you have any that are working as expected right now?

gzt5142 commented 1 year ago

Do you have any that are working as expected right now?

Yes. Most are behaving as expected.

The redirects mentioned above are not super serious (we can get at the data after redirect), but who knows how long the server redirect will persist. At some point, the link will 404.

I've also since learned that some of the metadata in the crawler_sources table is out of sync with the feature data returned from a source (column names being the most difficult to reconcile).

I have the data I need to do unit and integration testing, and even some end-2-end for some of the datasets in the crawler_source table. But it does seem that we should review the source table for current/correct/usable data. I think I'm going to put in an extra functionality to validate sources (data returns, column names are correct, etc) without actually updating the production database. Would be easy to implement and would let us know about source data issues easily.

gzt5142 commented 1 year ago

I think I have all of the functional pieces working on the bench (i.e. in a jupyter notebook), except for one piece relating features to NHD+ data. I'll work to have these functional pieces into a usable CLI before our next checkin.

gzt5142 commented 1 year ago

Some examples of current function for the command nldi-cli as currently implemented are available in the usage document over at the active fork:

https://github.com/gzt5142/nldi-crawler-py/blob/python-port/docs/usage.md

A couple of thoughts on validating sources....

> nldi-cli validate 10
Checking Vigil Network Data...  [FAIL] : Column not found for 'feature_reach' : nhdpv2_REACHCODE

> nldi-cli download 10
Source 10 downloaded to /home/trantham/nldi-crawler-py/CrawlerData_10_w2_08yh5.geojson

> nldi-cli validate 1
Checking Water Quality Portal...  [FAIL] : Network Timeout

> nldi-cli   validate 2
Checking HUC12 Pour Points...  [FAIL] : Invalid JSON

> nldi-cli validate 13
Checking geoconnex contribution demo sites...  [PASS]

By accident (I assume it is inadvertent), the demo database from nldi-db contains a crawler_source table with some unusable entries. This provides some useful test cases to be sure we trap for them:

Source number 1 (WQP), as mentioned above, times out; note that it fails validation checks from the CLI.
Source 10 (Vigil Network Data) supplies data, but the crawler_source information is incorrect. The crawler source table says that the Vigil data has a column/property called nhdpv2_REACHCODE where we can find the feature_reach property. But there is no such column/property in the returned data from Vigil. So we can't use that data source as currently configured.
Source 13 returns not-GeoJSON (it gives us XML). We can't use that either.

dblodgett-usgs commented 1 year ago

Right on! Interesting that the demo database has the invalid sources. Are they also invalid in https://github.com/internetofwater/nldi-db/blob/master/liquibase/changeLogs/nldi/nldi_data/update_crawler_source/crawler_source.tsv ??

Do you think we are at a place that we should try wiring this up in docker and testing it out on the dev NLDI instance?

gzt5142 commented 1 year ago

Are they also invalid in ...

Yes. Errors in the tsv also. I suspect that the source services just changed their output.. we should change the table to adapt to those changes.

I have one major hurdle before I'm ready to try on a dev instance for our first end-to-end test: The plumbing to 'connect' ingested features to the NHD basins in the NLDI database. I'll be able to sort that out this week (fingers crossed), then we can talk about what it would look like to run a test against dev.

internetofwater / nldi-crawler

Rewrite crawler in more accessible language. #193