Closed dblodgett-usgs closed 1 year ago
Some of the issues with the crawler_source table is that for some sources, the field/column names don't match the returned GeoJSON from sources. Example: The Feature ID
field for source 13 claimed to be id
. The returned GeoJSON from that source actually calls that field fid
.
I corrected those field names that I found in testing. Those changes are in commits to branch gt-097-source-table-fixes
. Propose to commit other corrections to that branch for a single PR to fix all errors.
This would be really good to get fixed up.
I'm super swamped, would you be willing to list out the urls that are not working?
I see this as timing out: https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no @jkreft-usgs may be able to help with an updated URL for that one?
Two sources time out: Sources 1 and 12.
I am using a network timeout of 30 seconds. The original java port timed out at 15. I lengthened the timeout to ensure that my network connections, etc were not affecting results.
> nldi-cli validate 1
1 : Checking Water Quality Portal... [FAIL] : Network Timeout
> nldi-cli display 1
ID= 1 :: Water Quality Portal
Source Suffix : WQP
Source URI : https://www.waterqualitydata.us/data/Station/search?mimeType=geojson&minactivities=1&counts=no
Feature ID : MonitoringLocationIdentifier
Feature Name : MonitoringLocationName
Feature URI : siteUrl
Feature Reach : None
Feature Measure: None
Ingest Type : point
Feature Type : varies
> nldi-cli validate 12
12 : Checking New Mexico Water Data Initative Sites... [FAIL] : Network Timeout
> nldi-cli display 12
ID=12 :: New Mexico Water Data Initative Sites
Source Suffix : nmwdi-st
Source URI : https://locations.newmexicowaterdata.org/collections/Things/items?f=json&limit=100000
Feature ID : id
Feature Name : name
Feature URI : geoconnex
Feature Reach : None
Feature Measure: None
Ingest Type : point
Feature Type : point
One source returns not-JSON:
Source number 2:
> nldi-cli validate 2
2 : Checking HUC12 Pour Points... [FAIL] : Invalid JSON
> nldi-cli display 2
ID= 2 :: HUC12 Pour Points
Source Suffix : huc12pp
Source URI : https://www.sciencebase.gov/catalogMaps/mapping/ows/57336b02e4b0dae0d5dd619a?service=WFS&version=1.0.0&request=GetFeature&srsName=EPSG:4326&typeName=sb:fpp&outputFormat=json
Feature ID : HUC_12
Feature Name : HUC_12
Feature URI : HUC_12
Feature Reach : None
Feature Measure: None
Ingest Type : point
Feature Type : hydrolocation
> nldi-cli download 2
Source 2 downloaded to /home/trantham/nldi-crawler-py/CrawlerData_2_p2wukdth.geojson
> more CrawlerData_2_p2wukdth.geojson
<?xml version="1.0" ?>
<ServiceExceptionReport
version="1.2.0"
xmlns="http://www.opengis.net/ogc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.opengis.net/ogc https://sciencebase.gov/catalog/wfs/1.0.0/OGC-exception.xsd">
<ServiceException>
java.lang.NullPointerException
null
</ServiceException></ServiceExceptionReport>
This looks like a server exception in the http service.
The New Mexico one came back for me with a longer time out (try 60 seconds?)
The waterquality portal is a different situation. I'll follow up with Jim and see what we can find out.
Good news.
A 60-second timeout has allowed me to download from source 12 (new mexico).
AND I'm able to hit source 1 (WQP) also.
It seems I was too conservative by setting my timeout to 30 in my first try.
Now the bad news... data returned from source 12 (new mexico) has brought to light a bug in the way we have been processing JSON features from other sources.
As this is a crawler issue (not a db issue), I'll address as an issue in that repo. See https://github.com/gzt5142/nldi-crawler-py/issues/22
@jkreft-usgs -- so I guess we are good with the WQP service with a big timeout.
I have been able to access the WQP data with timeouts above 30s... currently, my default timeout in the crawler is 60sec.
So... with the minor updates to column name in the TSV file, I think we're all fixed. I'm not sure how the corrected TSV is pushed to the production database... so that's the next step.
As noted in work on the crawler, some URLs don't work.