usgs-makerspace / makerspace-sandbox

Some initial R code for playing with data processing (maybe some light visualization).
Other
0 stars 5 forks source link

Hook up Tableau to data in S3 #431

Open wdwatkins opened 4 years ago

wdwatkins commented 4 years ago

Probably using AWS Athena. We'll reach out to Andrew first since he has used it

wdwatkins commented 4 years ago

@mhines-usgs I made a devops ticket for giving the Tableau IAM user permission to access to the buckets we've been using (or an alternative solution): https://internal.cida.usgs.gov/jira/browse/DEVOPS-1314. Once that is done, we should be able to set up Athena.

I'll have parquet versions of the current data files uploaded before I'm done today (Friday).

Remaining work should be:

wdwatkins commented 4 years ago

Will there be a way to have separate test/prod data sources?

mhines-usgs commented 4 years ago

each of the parquet files in s3 now has a table in the analytics_test database in athena stashing the bulk column add details here

table name: long_term_monthly
filename: long_term_monthly.parquet
path: s3://wma-analytics-data/dashboard/test/parquet/long_term_monthly/
year string,
month string,
sessions double,
avgSessionDuration double,
pageviewsPerSession double,
percentNewSessions double,
view_id string,
view_name string,
fiscal_year date,
backfill boolean

table name: page_load_30_days
path: s3://wma-analytics-data/dashboard/test/parquet/page_load/
filename: page_load_30_days.parquet
pagePath string,
pageLoadSample double,
avgPageLoadTime double,
avgPageDownloadTime double,
avgDomContentLoadedTime double,
exitRate double,
view_id string,
view_name string

table name: all_apps_traffic_data_3_years
filename: all_apps_traffic_data_3_years.parquet
path: s3://wma-analytics-data/dashboard/test/parquet/landing_exit_pages/
date date,
sessions double,
users double,
view_id string,
view_name string,
year date,
fiscal_year date

table name: year_month_week_traffic
path: s3://wma-analytics-data/dashboard/test/parquet/year_month_week/
filename: year_month_week_traffic.parquet
year_month_week_traffic.parquet
view_name string,
view_id string,
sessions double,
users double,
period string

table name: state_traffic_year_month_week
filename: state_traffic_year_month_week.parquet
path: s3://wma-analytics-data/dashboard/test/parquet/state_traffic/
state_traffic_year_month_week
region string,
country string,
sessions double,
view_id string,
view_name string,
period string

table name: all_apps_landing_exit_pages
path: s3://wma-analytics-data/dashboard/test/parquet/landing_exit_pages/
filename: all_apps_landing_exit_pages.parquet
landing_exit_pages
landingpagepath string,
secondpagepath string,
exitpagepath string,
sessions double,
view_id string,
view_name string

table name: summary_numbers
path: s3://wma-analytics-data/dashboard/test/parquet/summary_numbers/
file name: summary_numbers.parquet
deviceCategory string,
browser string,
dayOfWeekName string,
sessions double,
percentNewSessions double,
sessionDuration double,
view_id string,
view_name string,
period string,
newSessions double

table name: compared_to_last_year
path: s3://wma-analytics-data/dashboard/test/parquet/compared_to_last_year/
file name: compared_to_last_year.parquet
view_name string,
view_id string,
sessions_this_year double,
n_this_year bigint,
first_non_zero_date_this_year date,
sessions_last_year double,
n_last_year bigint,
first_non_zero_date_last_year date,
percent_change double,
period string

table name state_week_vs_year
path s3://wma-analytics-data/dashboard/test/parquet/state_week_vs_year/
file name state_week_vs_year.parquet
view_id string,
view_name string,
region string,
365_days double,
30_days double,
7_days double,
week_over_year double

table name state_traffic_population_percentages
path s3://wma-analytics-data/dashboard/test/parquet/state_traffic_population_percentages/
file name state_traffic_population_percentages.parquet
view_name string,
region string,
view_id string,
period string,
sessions double,
DIVISION string,
STATE bigint,
POPESTIMATE2019 bigint,
pop_pct double,
sessions_total_period double,
sessions_pct double,
sessions_population_ratio double,
country string
mhines-usgs commented 4 years ago

as I'm replacing data connections to the Athena published connection, it seems like I have to use a data extract rather than live connection, which hopefully should be OK since we are only updating the data once a day, in order to use several of the calculated field functions like makedate. https://community.tableau.com/thread/193967

mhines-usgs commented 4 years ago

all the existing plots are now using athena as data source. moving on to the other two plots (geo and bar graph for exec dash)

wdwatkins commented 4 years ago

So apparently the refresh schedule "Early weekday mornings" actually refreshes at 9 PM? image

mhines-usgs commented 4 years ago

for me it's 11pm! so yeah, I guess so? seems like a time-zone issue maybe or maybe it's UTC early morning?

wdwatkins commented 4 years ago

Yeah, I bet you're right, the server probably runs on UTC. Sounds like schedules have to be modified by admins: https://help.tableau.com/current/server/en-us/schedule_manage_create.htm. So I guess we'll need Dionne or someone to change this? This is making the data be a day old since the refresh doesn't happen until the evening.

mhines-usgs commented 4 years ago

ahh ok, sure I can reach out and ask for a custom schedule. do you have a suggestion that would work with when the Jenkins job runs?

mhines-usgs commented 4 years ago

Just looked at that Jenkins job timing, seems like 4am CT, so maybe at 5am CT / 10 UTC to give it time to run?

wdwatkins commented 4 years ago

Yeah that sounds good

mhines-usgs commented 4 years ago

sent in this help request which was assigned to someone but i haven't heard anything back yet

Hi there!

I've set up several data sources that we would like to run refresh extracts on in order to update daily, however based on the time zone difference (the servers must use UTC) one of the 3 options (weekday early mornings) for refreshing actually ends up running at 11pm CT/9pm PT and we'd rather have it run literally on weekdays in the morning to make sure it catches the new data that are produced by our Jenkins job that runs around 4am CT.

Long story short, it looks like according to Tableau that this can only be controlled by admins, I don't see the schedules option in my Tableau server view.

Can someone please add an option for 10AM UTC? This should be a good time for it to run after our Jenkins job finishes. This schedule addition should be implemented on both the development server (tableau.chs.usgs.gov) and public/internal server (tableau.usgs.gov) so that data sources on both servers can be scheduled to use this new schedule.

Here's the article that I mentioned:
https://help.tableau.com/current/server/en-us/schedule_manage_create.htm

Thank you!
Megan
mhines-usgs commented 4 years ago

image

i'll switch these over before I quit for today!

mhines-usgs commented 4 years ago

this is done now.