Closed roborourke closed 2 years ago
Hey team! Please add your planning poker estimate with ZenHub @mikelittle @robindevitt @shadyvb
Bit of info on the foundational code that we'll be building on:
I will check for Segment account that we can test with.
@roborourke After a discussion on Shadi's last comment, where the team got more clarity, we have estimated the issue at 5
Please create an account on https://app.segment.com/signup/ to test out with.
Notes:
receivedAt
is not useful unless we have both the client-side timestamp AND the server-side timestamp, and I'm not sure if we have the client side timestamp in the event entry, or which prop it is. I don't think this is too important anyways, Segment uses '*At' props to calculate the final timestamp
to account for any differences.endpoint.Attributes
) are meant for context
rather than traits
. Worth discussion expectations here.re identify
calls
identify
call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions
and Metrics.pageViews
might cause an update to the profile on each call. Is there another way to track this instead ?pageViews
AFTER a newer event is processed by Segment.identify
calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ? :thinking:messageId = endpoint.RequestId or attributes.x-amz-request-id if present
I think we should be always using RequestId
so the data would be consistent, or at least always using x-amz-request-id
. Whichever is always present.
receivedAt
is not useful unless we have both the client-side timestamp AND the server-side timestamp
We have event_timestamp
is the client side timestamp and arrival_timestamp
would be used for recievedAt
.
re Traits vs Context, I think traits are meant to refer to user attributes, where Context is meant to describe the action AND the endpoint. So for example all the information on the device and page ( from endpoint.Attributes ) are meant for context rather than traits. Worth discussion expectations here.
This depends on the endpoint being used I think. The documentation for the shared common properties and the individual endpoint event data should tell which data goes where in terms of context or traits.
I'm going on a limp discovering available attributes from existing data, so some attributes might not make sense ( been filtered by client ), we'll need to double check that during reviews.
Our default attributes can be determined from the analytics.js file and from the data structure docs. Assume anything not documented there is custom.
An identify call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions and Metrics.pageViews might cause an update to the profile on each call. Is there another way to track this instead
I imagine Segment can deduce those metrics from the available event data. For us it's a means of working out the current user's audience on the client side so I'd say we don't need to send it. Do we know for sure if a potential update to the profile on each call is a problem?
This also complicates the plan to send batch requests in parallel, because one older event might update a user's pageViews AFTER a newer event is processed by Segment.
If the events have timestamps and we process them in sequence this seems unlikely to me. I think we can leave those specific pieces of info out if we have to.
It's complicated to send identify calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ?
Is it complicated technically or just seems like too much data?
I'd go with sending them each time for now, I know it doesn't seem ideal but we don't know yet how Segment will cope. We could maybe track by endpoint for changes but I'd be more concerned about the memory usage of doing that for up to 5mb worth of data at a time.
I think we should be always using RequestId so the data would be consistent, or at least always using x-amz-request-id. Whichever is always present.
I actually got this implementation wrong it turns out, endpoint.RequestId
should only change when the endpoint data was last updated via the updateEndpoint()
call. Can you check the data we have available for x-amz-request-id
and whether that's consistently present? We should use that value ideally.
We have event_timestamp is the client side timestamp and arrival_timestamp would be used for recievedAt.
Done
re Traits vs Context, I think traits are meant to refer to user attributes, where Context is meant to describe the action AND the endpoint. So for example all the information on the device and page ( from endpoint.Attributes ) are meant for context rather than traits. Worth discussion expectations here.
This depends on the endpoint being used I think. The documentation for the shared common properties and the individual endpoint event data should tell which data goes where in terms of context or traits.
Rather that being contextual, their importance might differ based on the endpoint, but their primary concerns are the same across all API calls AFAIK.
An identify call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions and Metrics.pageViews might cause an update to the profile on each call. Is there another way to track this instead
I imagine Segment can deduce those metrics from the available event data. For us it's a means of working out the current user's audience on the client side so I'd say we don't need to send it. Do we know for sure if a potential update to the profile on each call is a problem?
Removed the two vars. And there is no docs discouraging the update profiles, but removing them should help that concern.
This also complicates the plan to send batch requests in parallel, because one older event might update a user's pageViews AFTER a newer event is processed by Segment.
If the events have timestamps and we process them in sequence this seems unlikely to me. I think we can leave those specific pieces of info out if we have to.
Yup, removed.
It's complicated to send identify calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ?
Is it complicated technically or just seems like too much data? I'd go with sending them each time for now, I know it doesn't seem ideal but we don't know yet how Segment will cope. We could maybe track by endpoint for changes but I'd be more concerned about the memory usage of doing that for up to 5mb worth of data at a time.
Agreed, and yes it's about sending too much data, but should be fine to start with.
I think we should be always using RequestId so the data would be consistent, or at least always using x-amz-request-id. Whichever is always present.
I actually got this implementation wrong it turns out, endpoint.RequestId should only change when the endpoint data was last updated via the updateEndpoint() call. Can you check the data we have available for x-amz-request-id and whether that's consistently present? We should use that value ideally.
I think you've fixed that recently, so I've set messageId
to endpoint.RequestId
, and x-amz-request-id
should already be tracked as part of the endpoint.Attributes
prop already.
@roborourke the PR is ready for another review.
Shall we let the customer require the package manually ? or do we ship it with the module ? Just so I figure out how to live-test it ( prob on our test stack ).
Also let me know if the tests are okay at the current state.
I think the tests are ok, not sure what you could do beyond comparing one or two sample events with how the batch request(s) will look without actually trying to send data. In theory we could mock the segment endpoint to check how we handle rejections / errors but strategically I think our time is better spent elsewhere right now.
I think the package should be included by default and switched on via config or if the required constants are defined.
Added a PR for inclusion of the package with the module, config values, and updated docs. See https://github.com/humanmade/altis-analytics/pull/245
PR #245 now approved and merged.
Thanks @mikelittle and @shadyvb! Closing this out
@shadyvb I ticked the last two acceptance criteria.
Bringing this back for functional testing on the test stack, which should be possible now that the PR is merged.
@shadyvb What do you to think to using a new issue #246 to capture the further QA and testing so we can close this out and capture the existing effort that has gone into it? Thinking it'll allow us to see the actual effort we've done rather than moving things along and dropping the estimates.
Yes, that's more sensible. 👍🏻
Dependent on the issue to implement a new batch processing tool for analytics exports: https://github.com/humanmade/aws-analytics/issues/320
We want to create a Segment integration that we can provide out of the box in Altis, this can exist as a separate package for portability. Altis will handle fetching and processing batches of events and delivering them to an action hook that can then handle delivering batches to different destinations.
For Segment we should use the Batch API: https://segment.com/docs/connections/sources/catalog/libraries/server/http-api/#batch
Altis analytics events have the user data etc embedded in them so in order to convert to Segment data it will need to be split in some ways.
Altis data structure: https://docs.altis-dxp.com/analytics/native/data-structure/#anatomy-of-an-event-record
Segment events to Altis event mapping:
endpoint.Attributes
andendpoint.Metrics
combined as the Segmenttraits
object valueevent_type
=pageView
event_type
that is notpageView
endpoint.Id
as the SegmentanonymousId
endpoint.User.UserId
as SegmentuserId
if availablemessageId
=endpoint.RequestId
orattributes.x-amz-request-id
if presenttimestamp
=event_timestamp
converted to ISO-8601,attributes.date
should workreceivedAt
=event_received
converted to ISO-8601groupId
if registered group attribute presentpage
in the same way as theproperties
property from the Page eventAcceptance criteria:
register_segment_group_field( string $field )
$field
is the name / path to a property in the Altis data e.g.endpoint.Attributes.company