Open tcnichol opened 5 years ago
Can you do a query to show how many datapoints are affected.
And, which DB and which data points?
Updated issue to add which database and table.
The result of this query
select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography) > 10000;
is 18773153
and there are 137686191 datapoints total.
So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.
I'll be trying out a fix on a much smaller table locally.
There are likely data points from Urbana, perhaps also some from St Louis and Manhattan, KS
From: Todd Nicholson notifications@github.com Sent: Thursday, May 23, 2019 5:25:29 PM To: terraref/computing-pipeline Cc: LeBauer, David Shaner - (dlebauer); Comment Subject: Re: [terraref/computing-pipeline] fix incorrect datapoints in datapoints table, geostreams db (#581)
The result of this query
select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography) > 10000;
is 18773153
and there are 137686191 datapoints total.
So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.
I'll be trying out a fix on a much smaller table locally.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/terraref/computing-pipeline/issues/581?email_source=notifications&email_token=AADRPZ4UKHT54DWO3E3CQHTPW4YXTA5CNFSM4HOXJ5GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWD2SRY#issuecomment-495429959, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AADRPZYEPMKND25YSA3KVDDPW4YXTANCNFSM4HOXJ5GA.
So I did some queries to find out where the datapoints in the db were.
Total number datapoints = 137686191 Near Maricopa = 118913038 Near Urbana = 181152
I ran this query based on the original issue https://github.com/terraref/computing-pipeline/issues/484
select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography) < 5000;
The result was 18592001 entries. This accounts for all the entries in datapoints that are not in Arizona or near Urbana.
So i got one of these entries. The gid is 164759042 and the source dataset is https://terraref.ncsa.illinois.edu/clowder/datasets/5b3040334f0c2b43c04f962a
and the geog is
0103000020E61000000100000004000000DFF73902A68940401D402BD7990151C0DFF73902A6894040BE7C55019A0151C0241700A2A5894040BE7C55019A0151C0241700A2A58940401D402BD7990151C0
i wrote a python script to get lat/lon from geography. It uses shapely.wkb and this function below returns something (Point, Multigon, etc) with lat/lon within AZ.
wkb.loads(t_geog, hex=True)
But when I use a geography from one of the datapoints that are 'bad' an error is thrown:
shapely.errors.WKBReadingError: Could not create geometry because of errors while reading input
The same error is thrown for other datapoints not in AZ or near Urbana. This sort of makes sense since the query to get them uses the lat,lon order instead of lon, lat order.
The datasets that are included in the data field for these datapoints entries have correct lat lon, so correct geography can be created from that. I'm not sure how it's generated for this table, but figured I'd report on what I found so far.
@tcnichol keep in mind you can use ST_X() and ST_Y() to parse the geog. I use something like SELECT ST_Y(ST_Centroid(geog)) AS lat, ST_X(ST_Centroid(geog)) AS lon FROM sensors
to get the latitude and longitude of the points.
I had to use the ST_AsText on the geog but this one worked - those points have the lat/lon that is in the Atlantic.
Again, I'm not sure the best way to fix this since I don't know how this database table is populated. If it's done by an extractor then resubmitting datasets might work. If we assume an offset based on the lon summing up to 180, then a script to update the db directly could work.
I created a datapoints table locally and added points that replicated the error found in datapoints. This query updates the lat/lon
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography) < 5000;
After this, all the points that were formerly in the Atlantic have lat/lon within Maricopa.
My local db table was very small (just 10 datapoints) so I am not sure how long this would take. I could also copy datapoints, and run this update on the copy.
please do an explain on this query (https://www.postgresql.org/docs/9.1/sql-explain.html) and we can get a sense of how it will be executed so we can see what indices exist, etc.
running this but it seems to take a very long time. Will post the results once it is done.
explain analyze verbose
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography) < 5000;
It stopped but I got this error :
ERROR: geometry contains non-closed rings
HINT: "...,33.0756200671313 -68.0250750753778))" <-- parse error at position 150 within geometry
Will be looking into this.
I found a query that works. I use ST_MakeValid and then take a centroid of the geography points. Many of them are polygons that are not closed. This 'explain' query might help :
explain
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_MakeValid(geog::geometry))), ST_X(ST_Centroid( ST_MakeValid(geog::geometry)))), 4326)::geography
where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography) < 5000;
Will check on the server.
according to the docs online, this should check the time for update without actually updating the table.
begin;
explain analyze
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_AsText(
ST_Centroid(ST_MakeValid(geog::geometry)))),
ST_X(ST_AsText(ST_Centroid(ST_MakeValid(geog::geometry))))), 4326)::geography
where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography) < 5000;
rollback;
Some datapoints in the datapoints table in the geostreams db were stored incorrectly. New entries to datapoints will be correct but older ones may need to be corrected.
this is related to this issue
https://github.com/terraref/computing-pipeline/issues/484