iangow / se_features

Linguistic features derived from StreetEvents
1 stars 3 forks source link

Sort out time-zone issue. #20

Closed iangow closed 4 years ago

iangow commented 4 years ago

Builds on discussion here.

Trying to fix code for preserving timestamp with time zone:

from sqlalchemy import create_engine
from sqlalchemy.types import DateTime
import pandas as pd
from pandas import Timestamp
import os

conn_string = 'postgresql://' + os.environ['PGHOST'] + '/' + os.environ['PGDATABASE']

engine = create_engine(conn_string)
limit_clause = "LIMIT 100"

sql = """
            SELECT file_name, max(last_update) as last_update
            FROM streetevents.calls
            GROUP BY file_name
            %s
        """ % (limit_clause)

files = pd.read_sql(sql, engine)
files['last_update'] = files['last_update'].map(lambda x: Timestamp(x))
files.to_sql("timestamp_test", engine, schema = "public", index = False,
             dtype = {'last_update': DateTime(timezone=True)},
             if_exists = "replace")

files.to_sql("timestamp_test_alt", engine, schema = "public", index = False,
             dtype = {'last_update': DateTime(timezone=False)},
             if_exists = "replace")
iangow commented 4 years ago

@Yvonne-Han I'm pretty sure we've solved this one before somewhere else. It's hard to keep track of all these things!

In any case, I think dtype = {'last_update': DateTime(timezone=True) might be the approach that @bdcallen has used.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, public")

timestamp_test <- tbl(pg, "timestamp_test")
timestamp_test_alt <- tbl(pg, "timestamp_test_alt")
calls <- tbl(pg, "calls")

total <-
  timestamp_test %>% 
  count() %>%
  pull()

matched <-
  timestamp_test %>%
  semi_join(calls) %>%
  count() %>%
  pull() 
#> Joining, by = c("file_name", "last_update")

sprintf("Test passed: %s", total==matched)
#> [1] "Test passed: TRUE"

total_alt <-
  timestamp_test_alt %>% 
  count() %>%
  pull()

matched_alt <-
  timestamp_test_alt %>%
  semi_join(calls) %>%
  count() %>%
  pull() 
#> Joining, by = c("file_name", "last_update")

sprintf("Alt test passed: %s", total_alt==matched_alt)
#> [1] "Alt test passed: FALSE"

temp <- 
  timestamp_test %>%
  select(last_update) %>% 
  collect(n=10) %>% 
  pull()

temp_alt <- 
  timestamp_test_alt %>%
  select(last_update) %>% 
  collect(n=10) %>% 
  pull()

# Sys.timezone()
Sys.setenv(TZ="America/New_York")
temp
#>  [1] "2005-07-29 15:37:34 EDT" "2005-10-27 16:32:58 EDT"
#>  [3] "2005-11-04 07:25:42 EST" "2005-07-21 14:41:18 EDT"
#>  [5] "2005-03-08 13:42:52 EST" "2005-10-20 17:12:32 EDT"
#>  [7] "2006-01-23 16:03:54 EST" "2005-10-05 13:38:21 EDT"
#>  [9] "2005-02-02 13:43:45 EST" "2005-03-07 15:26:47 EST"
temp_alt
#>  [1] "2005-07-30 01:37:34 EDT" "2005-10-28 02:32:58 EDT"
#>  [3] "2005-11-04 18:25:42 EST" "2005-07-22 00:41:18 EDT"
#>  [5] "2005-03-09 00:42:52 EST" "2005-10-21 03:12:32 EDT"
#>  [7] "2006-01-24 03:03:54 EST" "2005-10-05 23:38:21 EDT"
#>  [9] "2005-02-03 00:43:45 EST" "2005-03-08 02:26:47 EST"

Sys.setenv(TZ="Australia/Sydney")
temp
#>  [1] "2005-07-30 05:37:34 AEST" "2005-10-28 06:32:58 AEST"
#>  [3] "2005-11-04 23:25:42 AEDT" "2005-07-22 04:41:18 AEST"
#>  [5] "2005-03-09 05:42:52 AEDT" "2005-10-21 07:12:32 AEST"
#>  [7] "2006-01-24 08:03:54 AEDT" "2005-10-06 03:38:21 AEST"
#>  [9] "2005-02-03 05:43:45 AEDT" "2005-03-08 07:26:47 AEDT"
temp_alt
#>  [1] "2005-07-30 15:37:34 AEST" "2005-10-28 16:32:58 AEST"
#>  [3] "2005-11-05 10:25:42 AEDT" "2005-07-22 14:41:18 AEST"
#>  [5] "2005-03-09 16:42:52 AEDT" "2005-10-21 17:12:32 AEST"
#>  [7] "2006-01-24 19:03:54 AEDT" "2005-10-06 13:38:21 AEST"
#>  [9] "2005-02-03 16:43:45 AEDT" "2005-03-08 18:26:47 AEDT"

Sys.setenv(TZ="UTC")
temp
#>  [1] "2005-07-29 19:37:34 UTC" "2005-10-27 20:32:58 UTC"
#>  [3] "2005-11-04 12:25:42 UTC" "2005-07-21 18:41:18 UTC"
#>  [5] "2005-03-08 18:42:52 UTC" "2005-10-20 21:12:32 UTC"
#>  [7] "2006-01-23 21:03:54 UTC" "2005-10-05 17:38:21 UTC"
#>  [9] "2005-02-02 18:43:45 UTC" "2005-03-07 20:26:47 UTC"
temp_alt
#>  [1] "2005-07-30 05:37:34 UTC" "2005-10-28 06:32:58 UTC"
#>  [3] "2005-11-04 23:25:42 UTC" "2005-07-22 04:41:18 UTC"
#>  [5] "2005-03-09 05:42:52 UTC" "2005-10-21 07:12:32 UTC"
#>  [7] "2006-01-24 08:03:54 UTC" "2005-10-06 03:38:21 UTC"
#>  [9] "2005-02-03 05:43:45 UTC" "2005-03-08 07:26:47 UTC"

Created on 2020-04-16 by the reprex package (v0.3.0)

Yvonne-Han commented 4 years ago

@iangow Yes, I know how hard it is to keep track of everything. Sometimes I confuse myself too... (I shouldn't because it's just been ≤ a year). Probably you want to try Quiver?

I just went through your reprex. So does that mean I can use dtype = {'last_update': DateTime(timezone=True)} in your NER code now?

iangow commented 4 years ago

@Yvonne-Han

@iangow Yes, I know how hard it is to keep track of everything. Sometimes I confuse myself too... (I shouldn't because it's just been ≤ a year). Probably you want to try Quiver?

I just went through your reprex. So does that mean I can use dtype = {'last_update': DateTime(timezone=True)} in your NER code now?

I think so.

I just ran the code from here against the MCCGR server see updated comment above (the earlier code was running against a PostgreSQL server downstairs from me). From the changes, you can see that nothing changes when I use type = {'last_update': DateTime(timezone=True)}, but everything falls over with type = {'last_update': DateTime(timezone=False)}.

iangow commented 4 years ago

The problem with remembering the solution is that I have remember who did it, which project, etc. I have some stuff in gist.github.com/iangow.

iangow commented 4 years ago

Code I ran (where tz_test.py contains the code here):

igowmbp15:~ igow$ export PGHOST=10.101.13.99
igowmbp15:~ igow$ python3 tz_test.py 
igowmbp15:~ igow$ # Ran R code above here. 
igowmbp15:~ igow$ psql 
psql (12.2, server 11.5 (Ubuntu 11.5-1.pgdg18.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# DROP TABLE timestamp_test;
DROP TABLE
crsp=# DROP TABLE timestamp_test_alt;
DROP TABLE
crsp=# \q
igowmbp15:~ igow$