treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

digdag schedules output timezones confusing #32

Closed danielnorberg closed 8 years ago

danielnorberg commented 8 years ago

next session time and next runs at timezones are different, which is a bit confusing.

$ digdag schedules
2016-04-11 14:46:04 -0700: Digdag v0.5.9
Schedules:
  id: 2
  repository: dano-churn-prediction-poc
  workflow: +main
  next session time: 2016-04-13 00:00:00 +0900
  next runs at: 2016-04-12 15:00:00 -0700 (24h 13m 55s later)

1 entries.
Use `digdag workflows +NAME` to show workflow details.
frsyuki commented 8 years ago

Why is that confusing? What do you think better? "session_time" has concept of years, days, hours, etc. as numbers. They're important because workflow often uses timestamp formatted in text. Therefore digdag CLI shouldn't convert it to local time. Next run time is different. Formatting means nothing for machines. It's OK to use local time for humans.

danielnorberg commented 8 years ago

Maybe it's just that I'm confused about session vs run time.

danielnorberg commented 8 years ago

Are there any docs/material where I can read up on the significance of session vs run time?

frsyuki commented 8 years ago

That is extremely very important concept but no documents!!

Well, for example, you have a workflow that runs every day. It puts results to a table on TD with date's suffix like result_20160412. You'll use create_table: [result_${last_session_date_compact}] option with td> operator. Query will be like this:

select count(*) from data
where TD_TIME_RANGE(time, '2016-04-12 00:00:00 -0700', '2016-04-13 00:00:00 -0700')

which will be written in workflow definition as following:

select count(*) from data
where TD_TIME_RANGE(time, '${last_session_time}', '${session_time}')

But you noticed that you can't run the workflow at 00:00:00 because data is not ready at 00:00:00. You need to delay the start time for 3 hours. In this case, session_time should be 2016-04-13 00:00:00 -0700, but run time should be 2016-04-13 03:00:00 -0700.

This also happens when you retry the workflow. If you retry it 1 week later, run time is 1 week later but session_time should be consistently 2016-04-13 00:00:00 -0700.

danielnorberg commented 8 years ago

Ok, that makes sense, thanks for explaining! =)