toddwschneider / nyc-citibike-data

NYC Citi Bike system data and analysis
MIT License
241 stars 96 forks source link

Replacing start/end_latitude/longitude (numeric) to start/end_location (geometry) in trips and dockless_trips tables #15

Open estebanzimanyi opened 4 months ago

estebanzimanyi commented 4 months ago

You will find next the content of a diff file that creates the attribute at table creation time, which is immediate. Doing this after the table is created takes hours (I stopped the process after 1h). P.S. I am not able to attach the diff file here, this is why it is copy/pasted below

diff --git a/create_schema.sql b/create_schema.sql
index ff784c2..2b103e7 100644
--- a/create_schema.sql
+++ b/create_schema.sql
@@ -33,10 +33,8 @@ CREATE TABLE trips (
   gender integer,
   ride_id text,
   rideable_type text,
-  start_latitude numeric,
-  start_longitude numeric,
-  end_latitude numeric,
-  end_longitude numeric,
+  start_location geometry(Point),
+  end_location geometry(Point),
   start_station_name text,
   end_station_name text
 );
@@ -46,10 +44,8 @@ CREATE TABLE dockless_trips (
   trip_duration numeric,
   start_time timestamp without time zone,
   stop_time timestamp without time zone,
-  start_latitude numeric,
-  start_longitude numeric,
-  end_latitude numeric,
-  end_longitude numeric,
+  start_location geometry(Point),
+  end_location geometry(Point),
   bike_id integer,
   user_type text,
   birth_year integer,
diff --git a/populate_trips_from_raw.sql b/populate_trips_from_raw.sql
index 2d9af98..297e3d8 100644
--- a/populate_trips_from_raw.sql
+++ b/populate_trips_from_raw.sql
@@ -2,8 +2,7 @@ INSERT INTO trips
 (
   trip_duration, start_time, stop_time, start_station_id, end_station_id,
   bike_id, user_type, birth_year, gender, ride_id, rideable_type,
-  start_station_name, end_station_name, start_latitude, start_longitude,
-  end_latitude, end_longitude
+  start_station_name, end_station_name, start_location, end_location
 )
 SELECT
   trip_duration,
@@ -19,28 +18,23 @@ SELECT
   rideable_type,
   start_station_name,
   end_station_name,
-  nullif(start_station_latitude, 0) AS start_latitude,
-  nullif(start_station_longitude, 0) AS start_longitude,
-  nullif(end_station_latitude, 0) AS end_latitude,
-  nullif(end_station_longitude, 0) AS end_longitude
+  st_point(start_station_longitude, start_station_latitude, 4326) AS start_location,
+  st_point(end_station_longitude, end_station_latitude, 4326) AS end_location
 FROM trips_raw
 WHERE start_station_id IS NOT NULL
   AND end_station_id IS NOT NULL;

 INSERT INTO dockless_trips
 (
-  trip_duration, start_time, stop_time, start_latitude, start_longitude,
-  end_latitude, end_longitude, bike_id, user_type, birth_year, gender,
-  ride_id, rideable_type
+  trip_duration, start_time, stop_time, start_location, end_location,
+  bike_id, user_type, birth_year, gender, ride_id, rideable_type
 )
 SELECT
   trip_duration,
   start_time,
   stop_time,
-  start_station_latitude,
-  start_station_longitude,
-  end_station_latitude,
-  end_station_longitude,
+  st_point(start_station_longitude, start_station_latitude, 4326) AS start_location,
+  st_point(end_station_longitude, end_station_latitude, 4326) AS end_location,
   bike_id,
   user_type,
   nullif(nullif(birth_year, ''), 'NULL')::numeric::int AS birth_year,

Many thanks for this AMAZING work ! Esteban

toddwschneider commented 3 months ago

Hi @estebanzimanyi, thanks for the kind words

I'm not sure I understand though why the location fields on the trips table should be Postgis points. My idea was to keep all of the Postgis calculations on the stations table, since there are far fewer stations than trips, and theoretically every trip has to start and end at a station (except for dockless trips, though I haven't really focused on those...)

I've seen some apparently bad data in the raw trips data where a particular station ID might be associated mostly with a single lat/lon coordinate pair, but a small handful of trips at the same station ID have different lat/lon. In those cases, I'm inclined to go with the station ID, i.e. assume that all trips with a given station ID started at that station ID's most common lat/lon pair

Anyway, let me know if you have any particular thoughts about why the Point column type is needed on the trips table, and thanks