terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

fix incorrect datapoints in datapoints table, geostreams db #581

Open tcnichol opened 5 years ago

tcnichol commented 5 years ago

Some datapoints in the datapoints table in the geostreams db were stored incorrectly. New entries to datapoints will be correct but older ones may need to be corrected.

this is related to this issue

https://github.com/terraref/computing-pipeline/issues/484

robkooper commented 5 years ago

Can you do a query to show how many datapoints are affected.

dlebauer commented 5 years ago

And, which DB and which data points?

tcnichol commented 5 years ago

Updated issue to add which database and table.

tcnichol commented 5 years ago

The result of this query

 select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography)  > 10000;

is 18773153

and there are 137686191 datapoints total.

So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.

I'll be trying out a fix on a much smaller table locally.

dlebauer commented 5 years ago

There are likely data points from Urbana, perhaps also some from St Louis and Manhattan, KS


From: Todd Nicholson notifications@github.com Sent: Thursday, May 23, 2019 5:25:29 PM To: terraref/computing-pipeline Cc: LeBauer, David Shaner - (dlebauer); Comment Subject: Re: [terraref/computing-pipeline] fix incorrect datapoints in datapoints table, geostreams db (#581)

The result of this query

select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography) > 10000;

is 18773153

and there are 137686191 datapoints total.

So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.

I'll be trying out a fix on a much smaller table locally.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/terraref/computing-pipeline/issues/581?email_source=notifications&email_token=AADRPZ4UKHT54DWO3E3CQHTPW4YXTA5CNFSM4HOXJ5GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWD2SRY#issuecomment-495429959, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AADRPZYEPMKND25YSA3KVDDPW4YXTANCNFSM4HOXJ5GA.

tcnichol commented 5 years ago

So I did some queries to find out where the datapoints in the db were.

Total number datapoints = 137686191 Near Maricopa = 118913038 Near Urbana = 181152

I ran this query based on the original issue https://github.com/terraref/computing-pipeline/issues/484

select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

The result was 18592001 entries. This accounts for all the entries in datapoints that are not in Arizona or near Urbana.

So i got one of these entries. The gid is 164759042 and the source dataset is https://terraref.ncsa.illinois.edu/clowder/datasets/5b3040334f0c2b43c04f962a

and the geog is

0103000020E61000000100000004000000DFF73902A68940401D402BD7990151C0DFF73902A6894040BE7C55019A0151C0241700A2A5894040BE7C55019A0151C0241700A2A58940401D402BD7990151C0

i wrote a python script to get lat/lon from geography. It uses shapely.wkb and this function below returns something (Point, Multigon, etc) with lat/lon within AZ.

wkb.loads(t_geog, hex=True)

But when I use a geography from one of the datapoints that are 'bad' an error is thrown:

shapely.errors.WKBReadingError: Could not create geometry because of errors while reading input

The same error is thrown for other datapoints not in AZ or near Urbana. This sort of makes sense since the query to get them uses the lat,lon order instead of lon, lat order.

The datasets that are included in the data field for these datapoints entries have correct lat lon, so correct geography can be created from that. I'm not sure how it's generated for this table, but figured I'd report on what I found so far.

robkooper commented 5 years ago

@tcnichol keep in mind you can use ST_X() and ST_Y() to parse the geog. I use something like SELECT ST_Y(ST_Centroid(geog)) AS lat, ST_X(ST_Centroid(geog)) AS lon FROM sensors to get the latitude and longitude of the points.

tcnichol commented 5 years ago

I had to use the ST_AsText on the geog but this one worked - those points have the lat/lon that is in the Atlantic.

Again, I'm not sure the best way to fix this since I don't know how this database table is populated. If it's done by an extractor then resubmitting datasets might work. If we assume an offset based on the lon summing up to 180, then a script to update the db directly could work.

tcnichol commented 5 years ago

I created a datapoints table locally and added points that replicated the error found in datapoints. This query updates the lat/lon

update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

After this, all the points that were formerly in the Atlantic have lat/lon within Maricopa.

My local db table was very small (just 10 datapoints) so I am not sure how long this would take. I could also copy datapoints, and run this update on the copy.

max-zilla commented 5 years ago

please do an explain on this query (https://www.postgresql.org/docs/9.1/sql-explain.html) and we can get a sense of how it will be executed so we can see what indices exist, etc.

tcnichol commented 5 years ago

running this but it seems to take a very long time. Will post the results once it is done.

explain analyze verbose
update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;
tcnichol commented 5 years ago

It stopped but I got this error :

ERROR:  geometry contains non-closed rings
HINT:  "...,33.0756200671313 -68.0250750753778))" <-- parse error at position 150 within geometry

Will be looking into this.

tcnichol commented 5 years ago

I found a query that works. I use ST_MakeValid and then take a centroid of the geography points. Many of them are polygons that are not closed. This 'explain' query might help :

explain 
update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_MakeValid(geog::geometry))), ST_X(ST_Centroid( ST_MakeValid(geog::geometry)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

Will check on the server.

tcnichol commented 5 years ago

according to the docs online, this should check the time for update without actually updating the table.

begin;
explain analyze
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_AsText(
    ST_Centroid(ST_MakeValid(geog::geometry)))),
                               ST_X(ST_AsText(ST_Centroid(ST_MakeValid(geog::geometry))))), 4326)::geography
                                where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;
rollback;