internetofwater / nldi-db

Network Linked Data Index Database Component
https://waterdata.usgs.gov/blog/nldi-intro/
Creative Commons Zero v1.0 Universal
3 stars 15 forks source link

Fix URLs in crawler-source so they are all functional. #97

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

As noted in work on the crawler, some URLs don't work.

gzt5142 commented 1 year ago

Some of the issues with the crawler_source table is that for some sources, the field/column names don't match the returned GeoJSON from sources. Example: The Feature ID field for source 13 claimed to be id. The returned GeoJSON from that source actually calls that field fid.

I corrected those field names that I found in testing. Those changes are in commits to branch gt-097-source-table-fixes. Propose to commit other corrections to that branch for a single PR to fix all errors.

dblodgett-usgs commented 1 year ago

This would be really good to get fixed up.

I'm super swamped, would you be willing to list out the urls that are not working?

I see this as timing out: https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no @jkreft-usgs may be able to help with an updated URL for that one?

gzt5142 commented 1 year ago

Two sources time out: Sources 1 and 12.

I am using a network timeout of 30 seconds. The original java port timed out at 15. I lengthened the timeout to ensure that my network connections, etc were not affecting results.

> nldi-cli validate 1
1 : Checking Water Quality Portal...  [FAIL] : Network Timeout

> nldi-cli display 1
ID= 1 :: Water Quality Portal
  Source Suffix  : WQP
  Source URI     : https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no
  Feature ID     : MonitoringLocationIdentifier
  Feature Name   : MonitoringLocationName
  Feature URI    : siteUrl
  Feature Reach  : None
  Feature Measure: None
  Ingest Type    : point
  Feature Type   : varies

> nldi-cli validate 12
12 : Checking New Mexico Water Data Initative Sites...  [FAIL] : Network Timeout

> nldi-cli display 12
ID=12 :: New Mexico Water Data Initative Sites
  Source Suffix  : nmwdi-st
  Source URI     : https://locations.newmexicowaterdata.org/collections/Things/items?f=json&limit=100000
  Feature ID     : id
  Feature Name   : name
  Feature URI    : geoconnex
  Feature Reach  : None
  Feature Measure: None
  Ingest Type    : point
  Feature Type   : point
gzt5142 commented 1 year ago

One source returns not-JSON:

Source number 2:

> nldi-cli validate 2
2 : Checking HUC12 Pour Points...  [FAIL] : Invalid JSON

> nldi-cli display 2
ID= 2 :: HUC12 Pour Points
  Source Suffix  : huc12pp
  Source URI     : https://www.sciencebase.gov/catalogMaps/mapping/ows/57336b02e4b0dae0d5dd619a?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json
  Feature ID     : HUC_12
  Feature Name   : HUC_12
  Feature URI    : HUC_12
  Feature Reach  : None
  Feature Measure: None
  Ingest Type    : point
  Feature Type   : hydrolocation

> nldi-cli  download 2
Source 2 downloaded to /home/trantham/nldi-crawler-py/CrawlerData_2_p2wukdth.geojson

> more CrawlerData_2_p2wukdth.geojson
<?xml version="1.0" ?>
<ServiceExceptionReport
   version="1.2.0"
   xmlns="http://www.opengis.net/ogc"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.opengis.net/ogc https://sciencebase.gov/catalog/wfs/1.0.0/OGC-exception.xsd">
   <ServiceException>
      java.lang.NullPointerException
null
</ServiceException></ServiceExceptionReport>

This looks like a server exception in the http service.

dblodgett-usgs commented 1 year ago

This: https://www.sciencebase.gov/catalogMaps/mapping/ows/57336b02e4b0dae0d5dd619a?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json

Should be: https://www.sciencebase.gov/catalogMaps/mapping/ows/5b4e25a6e4b06a6dd17e4879?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json

The New Mexico one came back for me with a longer time out (try 60 seconds?)

The waterquality portal is a different situation. I'll follow up with Jim and see what we can find out.

gzt5142 commented 1 year ago

Good news.
A 60-second timeout has allowed me to download from source 12 (new mexico).

AND I'm able to hit source 1 (WQP) also.

It seems I was too conservative by setting my timeout to 30 in my first try.

gzt5142 commented 1 year ago

Now the bad news... data returned from source 12 (new mexico) has brought to light a bug in the way we have been processing JSON features from other sources.

As this is a crawler issue (not a db issue), I'll address as an issue in that repo. See https://github.com/gzt5142/nldi-crawler-py/issues/22

dblodgett-usgs commented 1 year ago

@jkreft-usgs -- so I guess we are good with the WQP service with a big timeout.

gzt5142 commented 1 year ago

I have been able to access the WQP data with timeouts above 30s... currently, my default timeout in the crawler is 60sec.

So... with the minor updates to column name in the TSV file, I think we're all fixed. I'm not sure how the corrected TSV is pushed to the production database... so that's the next step.