Closed ncloudioj closed 7 years ago
FYI @6a68 @emtwo
I do think UT is using JSON fields in general, but I'm not sure if in particular they're using it for payload
and variants
. I would still guess that the JSON fields are harder/slower to query and less ideal. A flat schema with an experiment_id
type field would be my preference.
After messing around the JSON support in Redshift, here is what I've found so far.
char
/varchar
/text
, and store a JSON dump (could be integer, string, array, null, or object) in thereNote that querying JSON columns may be slow, quote from the Redshift
We recommend using JSON sparingly. JSON is not a good choice for storing larger datasets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture.
@emtwo So your guess is correct :). However, I think we could consider using JSON field in the following cases:
@ncloudioj @emtwo How about this alternative approach to the testpilotttest
schema and associated Test Pilot experiment storage in ping centre Redshift:
docs/metrics.md
in the experiment's github repo. What if we use that for each experiment? Here's a simple example, and here's one with some extra keys.testpilottest
schema would then just need to be flat and very permissive, since each experiment will generally have its own set of extra keys.testpilottest
pings based on the addon ID (so, that'll be a required field in the schema), and shuffle data to per-experiment Redshift tables.Under this proposal, experiment author steps for Ping Centre integration would be:
testpilot-metrics
repo in their addonDoes this seem reasonable?
This seems reasonable to me. CC @tspurway to comment on the ETL portion. (though I believe he will be away until Wednesday)
Minor update to my last comment: it turns out we don't need a separate testpilottest
schema, after making some changes to the testpilot
schema in PR #36. The ETL step could split all Test Pilot pings on the addon_id
field, both from the Test Pilot addon and the experiments.
Closing this issue as @6a68 and @emtwo have sorted out the above game plan.
Thank you guys! Let's do this.
In the
testpilottest.js
, both "payload" and "variants" are defined as object fields in the schema. Unfortunately, Redshift doesn't support the nested object in the schema. To work around this limitation, we could consider using following options:experiment_id
to track the A/B test for each Testpilot test, this is already being used by "Activity Stream", and could replace the "variants" field