Create Segment integration for S3 batch export

roborourke commented 2 years ago

Dependent on the issue to implement a new batch processing tool for analytics exports: https://github.com/humanmade/aws-analytics/issues/320

We want to create a Segment integration that we can provide out of the box in Altis, this can exist as a separate package for portability. Altis will handle fetching and processing batches of events and delivering them to an action hook that can then handle delivering batches to different destinations.

For Segment we should use the Batch API: https://segment.com/docs/connections/sources/catalog/libraries/server/http-api/#batch

Altis analytics events have the user data etc embedded in them so in order to convert to Segment data it will need to be split in some ways.

Altis data structure: https://docs.altis-dxp.com/analytics/native/data-structure/#anatomy-of-an-event-record

Segment events to Altis event mapping:

Identify
- Use endpoint.Attributes and endpoint.Metrics combined as the Segment traits object value
- See https://segment.com/docs/connections/spec/identify/, map any standardised traits
Page - Altis event_type = pageView
- See https://segment.com/docs/connections/spec/page/
Track - Any event_type that is not pageView
Group
- Check registered / filtered groups, map field to group ID
For all segment event types:
- Use endpoint.Id as the Segment anonymousId
- Use endpoint.User.UserId as Segment userId if available
- messageId = endpoint.RequestId or attributes.x-amz-request-id if present
- timestamp = event_timestamp converted to ISO-8601, attributes.date should work
- receivedAt = event_received converted to ISO-8601
- groupId if registered group attribute present
- Populate the context object https://segment.com/docs/connections/spec/common/#context:
  - Populate page in the same way as the properties property from the Page event

Acceptance criteria:

[x] Received events batches are processed as a Segment batch JSON object following the above guidelines
[x] Add filter for batch data prior to sending
[x] Add function for registering groups based on event attributes e.g.
- register_segment_group_field( string $field )
- $field is the name / path to a property in the Altis data e.g. endpoint.Attributes.company
[x] Batches POST body sizes are no larger than 500kb, chunk the request if needed
[x] Package is submitted to Packagist
[x] Package is required by the Analytics module

roborourke commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @mikelittle @robindevitt @shadyvb

shadyvb commented 2 years ago

Bit of info on the foundational code that we'll be building on:

There is an action that we'll hook into, which delivers the bulk of data as a lump NDJSON string.
We'll need to adjust the format to match expected format from segment, and chunk it in the process, eg: check after each event if the limit has been reached to start a new chunk.

veselala commented 2 years ago

I will check for Segment account that we can test with.

yumito commented 2 years ago

@roborourke After a discussion on Shadi's last comment, where the team got more clarity, we have estimated the issue at 5

veselala commented 2 years ago

Please create an account on https://app.segment.com/signup/ to test out with.

shadyvb commented 2 years ago

Notes:

receivedAt is not useful unless we have both the client-side timestamp AND the server-side timestamp, and I'm not sure if we have the client side timestamp in the event entry, or which prop it is. I don't think this is too important anyways, Segment uses '*At' props to calculate the final timestamp to account for any differences.
re Traits vs Context, I think traits are meant to refer to user attributes, where Context is meant to describe the action AND the endpoint. So for example all the information on the device and page ( from endpoint.Attributes ) are meant for context rather than traits. Worth discussion expectations here.
I'm going on a limp discovering available attributes from existing data, so some attributes might not make sense ( been filtered by client ), we'll need to double check that during reviews.

re identify calls

An identify call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions and Metrics.pageViews might cause an update to the profile on each call. Is there another way to track this instead ?
This also complicates the plan to send batch requests in parallel, because one older event might update a user's pageViews AFTER a newer event is processed by Segment.
It's complicated to send identify calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ? :thinking:

shadyvb commented 2 years ago

messageId = endpoint.RequestId or attributes.x-amz-request-id if present

I think we should be always using RequestId so the data would be consistent, or at least always using x-amz-request-id. Whichever is always present.

roborourke commented 2 years ago

receivedAt is not useful unless we have both the client-side timestamp AND the server-side timestamp

We have event_timestamp is the client side timestamp and arrival_timestamp would be used for recievedAt.

re Traits vs Context, I think traits are meant to refer to user attributes, where Context is meant to describe the action AND the endpoint. So for example all the information on the device and page ( from endpoint.Attributes ) are meant for context rather than traits. Worth discussion expectations here.

This depends on the endpoint being used I think. The documentation for the shared common properties and the individual endpoint event data should tell which data goes where in terms of context or traits.

I'm going on a limp discovering available attributes from existing data, so some attributes might not make sense ( been filtered by client ), we'll need to double check that during reviews.

Our default attributes can be determined from the analytics.js file and from the data structure docs. Assume anything not documented there is custom.

An identify call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions and Metrics.pageViews might cause an update to the profile on each call. Is there another way to track this instead

I imagine Segment can deduce those metrics from the available event data. For us it's a means of working out the current user's audience on the client side so I'd say we don't need to send it. Do we know for sure if a potential update to the profile on each call is a problem?

This also complicates the plan to send batch requests in parallel, because one older event might update a user's pageViews AFTER a newer event is processed by Segment.

If the events have timestamps and we process them in sequence this seems unlikely to me. I think we can leave those specific pieces of info out if we have to.

It's complicated to send identify calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ?

Is it complicated technically or just seems like too much data?

I'd go with sending them each time for now, I know it doesn't seem ideal but we don't know yet how Segment will cope. We could maybe track by endpoint for changes but I'd be more concerned about the memory usage of doing that for up to 5mb worth of data at a time.

I think we should be always using RequestId so the data would be consistent, or at least always using x-amz-request-id. Whichever is always present.

I actually got this implementation wrong it turns out, endpoint.RequestId should only change when the endpoint data was last updated via the updateEndpoint() call. Can you check the data we have available for x-amz-request-id and whether that's consistently present? We should use that value ideally.

shadyvb commented 2 years ago

We have event_timestamp is the client side timestamp and arrival_timestamp would be used for recievedAt.

Done

re Traits vs Context, I think traits are meant to refer to user attributes, where Context is meant to describe the action AND the endpoint. So for example all the information on the device and page ( from endpoint.Attributes ) are meant for context rather than traits. Worth discussion expectations here.

This depends on the endpoint being used I think. The documentation for the shared common properties and the individual endpoint event data should tell which data goes where in terms of context or traits.

Rather that being contextual, their importance might differ based on the endpoint, but their primary concerns are the same across all API calls AFAIK.

An identify call is preferred to be sent on registration/login/profile-changes, sending Metrics.sessions and Metrics.pageViews might cause an update to the profile on each call. Is there another way to track this instead

I imagine Segment can deduce those metrics from the available event data. For us it's a means of working out the current user's audience on the client side so I'd say we don't need to send it. Do we know for sure if a potential update to the profile on each call is a problem?

Removed the two vars. And there is no docs discouraging the update profiles, but removing them should help that concern.

This also complicates the plan to send batch requests in parallel, because one older event might update a user's pageViews AFTER a newer event is processed by Segment.

If the events have timestamps and we process them in sequence this seems unlikely to me. I think we can leave those specific pieces of info out if we have to.

Yup, removed.

It's complicated to send identify calls for each Segment event, would substantially increase the amount of data to send, and would have less impact on the actual data store at Segment, should we invest the time/power to track sync status for different identities ?

Is it complicated technically or just seems like too much data? I'd go with sending them each time for now, I know it doesn't seem ideal but we don't know yet how Segment will cope. We could maybe track by endpoint for changes but I'd be more concerned about the memory usage of doing that for up to 5mb worth of data at a time.

Agreed, and yes it's about sending too much data, but should be fine to start with.

I think we should be always using RequestId so the data would be consistent, or at least always using x-amz-request-id. Whichever is always present.

I actually got this implementation wrong it turns out, endpoint.RequestId should only change when the endpoint data was last updated via the updateEndpoint() call. Can you check the data we have available for x-amz-request-id and whether that's consistently present? We should use that value ideally.

I think you've fixed that recently, so I've set messageId to endpoint.RequestId, and x-amz-request-id should already be tracked as part of the endpoint.Attributes prop already.

shadyvb commented 2 years ago

@roborourke the PR is ready for another review.

Shall we let the customer require the package manually ? or do we ship it with the module ? Just so I figure out how to live-test it ( prob on our test stack ).

shadyvb commented 2 years ago

Also let me know if the tests are okay at the current state.

roborourke commented 2 years ago

I think the tests are ok, not sure what you could do beyond comparing one or two sample events with how the batch request(s) will look without actually trying to send data. In theory we could mock the segment endpoint to check how we handle rejections / errors but strategically I think our time is better spent elsewhere right now.

I think the package should be included by default and switched on via config or if the required constants are defined.

shadyvb commented 2 years ago

Added a PR for inclusion of the package with the module, config values, and updated docs. See https://github.com/humanmade/altis-analytics/pull/245

mikelittle commented 2 years ago

PR #245 now approved and merged.

roborourke commented 2 years ago

Thanks @mikelittle and @shadyvb! Closing this out

mikelittle commented 2 years ago

@shadyvb I ticked the last two acceptance criteria.

shadyvb commented 2 years ago

Bringing this back for functional testing on the test stack, which should be possible now that the PR is merged.

roborourke commented 2 years ago

@shadyvb What do you to think to using a new issue #246 to capture the further QA and testing so we can close this out and capture the existing effort that has gone into it? Thinking it'll allow us to see the actual effort we've done rather than moving things along and dropping the estimates.

shadyvb commented 2 years ago

Yes, that's more sensible. 👍🏻

humanmade / altis-analytics

Create Segment integration for S3 batch export #220