snowplow / enrich

Snowplow Enrichment jobs and library
https://snowplowanalytics.com
Other
21 stars 39 forks source link

Common: add an enrichment extracting canonical properties into dedicated contexts #47

Open chuwy opened 4 years ago

chuwy commented 4 years ago

In order to refactor atomic events we need to extract all non-generic information from a fat table into dedicated contexts and preserve only common properties. As a first step, we can have those properties in atomic event (as we do now, to not break data models) and in their deciated tables/columns (to start writing new data models).

I tried to summarize what contexts and event-specific properties can be extracted out of Event:

  1. app_id
  2. platform
  3. etl_tstamp
  4. collector_tstamp
  5. dvce_created_tstamp
  6. event
  7. event_id
  8. txn_id
  9. name_tracker
  10. v_tracker
  11. v_collector
  12. v_etl
  13. user_id
  14. user_ipaddress
  15. user_fingerprint
  16. domain_userid
  17. domain_sessionidx
  18. network_userid
  19. geo_country - MaxMind context
  20. geo_region - MaxMind context
  21. geo_city - MaxMind context
  22. geo_zipcode - MaxMind context
  23. geo_latitude - MaxMind context
  24. geo_longitude - MaxMind context
  25. geo_region_name - MaxMind context
  26. ip_isp - MaxMind context
  27. ip_organization - MaxMind context
  28. ip_domain - MaxMind context
  29. ip_netspeed - MaxMind context
  30. page_url - Web page context (source of truth)
  31. page_title - Web page context (source of truth)
  32. page_referrer - Referrer context (source of truth)
  33. page_urlscheme - Web page context
  34. page_urlhost - Web page context
  35. page_urlport - Web page context
  36. page_urlpath - Web page context
  37. page_urlquery - Web page context
  38. page_urlfragment - Web page context
  39. refr_urlscheme - Referrer context
  40. refr_urlhost - Referrer context
  41. refr_urlport - Referrer context
  42. refr_urlpath - Referrer context
  43. refr_urlquery - Referrer context
  44. refr_urlfragment - Referrer context
  45. refr_medium - Referrer context
  46. refr_source - Referrer context
  47. refr_term - Referrer context
  48. mkt_medium - Marketing campaign context
  49. mkt_source - Marketing campaign context
  50. mkt_term - Marketing campaign context
  51. mkt_content - Marketing campaign context
  52. mkt_campaign - Marketing campaign context
  53. contexts
  54. se_category - Struct event self-describing event
  55. se_action - Struct event self-describing event
  56. se_label - Struct event self-describing event
  57. se_property - Struct event self-describing event
  58. se_value - Struct event self-describing event
  59. unstruct_event
  60. tr_orderid - Ecommerce transaction self-describing event
  61. tr_affiliation - Ecommerce transaction self-describing event
  62. tr_total - Ecommerce transaction self-describing event
  63. tr_tax - Ecommerce transaction self-describing event
  64. tr_shipping - Ecommerce transaction self-describing event
  65. tr_city - Ecommerce transaction self-describing event
  66. tr_state - Ecommerce transaction self-describing event
  67. tr_country - Ecommerce transaction self-describing event
  68. ti_orderid - Ecommerce transaction item context
  69. ti_sku - Ecommerce transaction item context
  70. ti_name - Ecommerce transaction item context
  71. ti_category - Ecommerce transaction item context
  72. ti_price - Ecommerce transaction item context
  73. ti_quantity - Ecommerce transaction item context
  74. pp_xoffset_min - Page ping self-describing event
  75. pp_xoffset_max - Page ping self-describing event
  76. pp_yoffset_min - Page ping self-describing event
  77. pp_yoffset_max - Page ping self-describing event
  78. useragent - Browser context (but populated from different places)
  79. br_name - Browser context (but populated from different places) (ua-utils)
  80. br_family - Browser context (but populated from different places) (ua-utils)
  81. br_version - Browser context (but populated from different places) (ua-utils)
  82. br_type - Browser context (but populated from different places) (ua-utils)
  83. br_renderengine - Browser context (but populated from different places) (ua-utils)
  84. br_lang - Browser context (but populated from different places)
  85. br_features_pdf - Browser context (but populated from different places)
  86. br_features_flash - Browser context (but populated from different places)
  87. br_features_java - Browser context (but populated from different places)
  88. br_features_director - Browser context (but populated from different places)
  89. br_features_quicktime - Browser context (but populated from different places)
  90. br_features_realplayer - Browser context (but populated from different places)
  91. br_features_windowsmedia - Browser context (but populated from different places)
  92. br_features_gears - Browser context (but populated from different places)
  93. br_features_silverlight - Browser context (but populated from different places)
  94. br_cookies - Browser context (but populated from different places)
  95. br_colordepth - Browser context (but populated from different places)
  96. br_viewwidth - Browser context (but populated from different places)
  97. br_viewheight - Browser context (but populated from different places)
  98. os_name - Browser context (but populated from different places) (ua-utils)
  99. os_family - Browser context (but populated from different places) (ua-utils)
  100. os_manufacturer - Browser context (but populated from different places)
  101. os_timezone - Browser context (but populated from different places)
  102. dvce_type - Browser context (but populated from different places) (ua-utils)
  103. dvce_ismobile - Browser context (but populated from different places) (ua-utils)
  104. dvce_screenwidth - Browser context (but populated from different places)
  105. dvce_screenheight - Browser context (but populated from different places)
  106. doc_charset - Web page (or document) context
  107. doc_width - Web page (or document) context
  108. doc_height - Web page (or document) context
  109. tr_currency - Ecommerce transaction self-describing event
  110. tr_total_base - Ecommerce transaction self-describing event
  111. tr_tax_base - Ecommerce transaction self-describing event
  112. tr_shipping_base - Ecommerce transaction self-describing event
  113. ti_currency - Ecommerce transaction item context
  114. ti_price_base - Ecommerce transaction item context
  115. base_currency - Ecommerce transaction self-describing event
  116. geo_timezone - MaxMind context
  117. mkt_clickid - Marketing campaign context
  118. mkt_network - Marketing campaign context
  119. etl_tags
  120. dvce_sent_tstamp
  121. refr_domain_userid - Referrer context
  122. refr_dvce_tstamp - Referrer context
  123. derived_contexts
  124. domain_sessionid
  125. derived_tstamp
  126. event_vendor
  127. event_name
  128. event_format
  129. event_version
  130. event_fingerprint - This should remain in canonical event
  131. true_tstamp

Their grouping is not very semantic, but should be based mostly on the info source, e.g. although browser/device info semantically is the same information, some of properties are passed thourgh the tracker protocol and some derived through user-agent enrichment.

Contexts

Self-describing events

Common properties

It leaves us with 31 core properties that can be set almost for all events/pipelines. Maybe some of them (user/device identification) can/should be moved into dedicated contexts.

  1. event_id - event identification
  2. app_id - event identification
  3. event - eventually will be discarded in favor of vendor/name/version
  4. txn_id - event identification
  5. event_vendor - event identification
  6. event_name - event identification
  7. event_format - event identification
  8. event_version - event identification
  9. event_fingerprint - event identification
  10. platform - probably should be moved as well
  11. dvce_created_tstamp - timestamps
  12. dvce_sent_tstamp - timestamps
  13. collector_tstamp - timestamps
  14. etl_tstamp - timestamps
  15. derived_tstamp - timestamps
  16. true_tstamp - timestamps
  17. user_id - user/device identification
  18. user_ipaddress - user/device identification
  19. user_fingerprint - user/device identification
  20. domain_userid - user/device identification
  21. domain_sessionidx - user/device identification
  22. domain_sessionid - user/device identification
  23. network_userid - user/device identification
  24. name_tracker - pipeline/aux
  25. v_tracker - pipeline/aux
  26. v_collector - pipeline/aux
  27. v_etl - pipeline/aux
  28. etl_tags - pipeline/aux
  29. unstruct_event - payload
  30. contexts- payload
  31. derived_contexts - payload
chuwy commented 4 years ago

Migrated from https://github.com/snowplow/snowplow/issues/4244 (comments are auto-generated)

chuwy commented 3 years ago

I've created a spreadsheet, proposing what new contexts and events should look like: https://docs.google.com/spreadsheets/d/1UaXrH92IvRWyXNU8wUQ-oxvEI9kJxoxbIcbRjna7RAI/edit#gid=0

BioQwer commented 1 year ago

@chuwy do you have enrichments config for full atomic schema?

benjben commented 1 year ago

Hi @BioQwer , which config are you refering to ? FYI this issue is still on our roadmap but this has not been prioritized yet.

BioQwer commented 1 year ago

I work with Open Source version. I have many empty values in atomic columns