mozilla-services / services-engineering

Services engineering core repo - Used for issues/docs/etc that don't obviously belong in another repo.
2 stars 1 forks source link

Create AWS query for unbridled UAIDs #47

Closed jrconlin closed 4 years ago

jrconlin commented 4 years ago

Due to https://bugzilla.mozilla.org/show_bug.cgi?id=1617136, we wanted to find out how much costs might increase. This requires finding how many UAIDs do not have channels associated with them.

Need to compose a AWS DynamoDB query to find these unbridled UAID records. compared to existing data.

jrconlin commented 4 years ago

This presumes using the AWS CLIv2

draft 1:

TABLE=$1
aws dynamodb scan --table-name $TABLE --filter-expression 'chidmessageid = :chid and attribute_not_exists(chids)' --select COUNT --expression-attribute-values='{":chid":{"S":" "}}'

The TABLE name can be found via aws dynamodb list-tables command

jrconlin commented 4 years ago

Xref: 🐞 https://bugzilla.mozilla.org/show_bug.cgi?id=1630338

jrconlin commented 4 years ago

From above bug:

aws dynamodb scan --table-name $TABLE --filter-expression 'chidmessageid = :chid and attribute_not_exists(chids)' --select COUNT --expression-attribute-values='{":chid":{"S":" "}}'
{
    "Count": 5713425,
    "ScannedCount": 571685091,
    "ConsumedCapacity": null
}

so of 571,685,091 records, 5,713,425 contain no ChannelIDs (meaning no subscriptions are present), or approximately 0.01% of our user base.

pjenvey commented 4 years ago

I'm not sure this query covers the users (I'll call them "broadcast only") in question.

"Broadcast only" users would have only HELLO'd for the sake of receiving broadcasts, but never subscribed to any Push channels. In that case they wouldn't have any entry at all in the message table, just a router entry.

jrconlin commented 4 years ago

Fair points. The problem is that doing a search for the actual number of clients would involve a table scan and a iteration of queries, which I think we should never, ever do. Thanks for the discussion about this we had, where we determined that we would not get any additional writes, but would have more junk data in the router table. We should add a TTL to webpush to handle that, as well as consider purging older records from the router db.