Open viggin543 opened 3 years ago
The whole user recognition flow relies on the fact that we could pull user by anonymous_id relatively quickly. Besides redis following storages will work:
S3
. Unlike regular file system, they can handle dirs with unlimited filesLevelDB
(or similar)Here are the caveats:
I believe that the best way to deal with that would be a) sending data to S3 with Jitsu b) writing a Spark job that processes the data and sends the updated events to Jitsu. The real-time aspect of Jitsu will be lost, but anyway it will do a better job comparing to writing an in-house merger
@vklimontovich With a db Like clickhouse that does not support UPDATE statements I see your Point.
But with a db like redshift or snowflake this can be achieved by only storing unique anonymous user ids in redis. And on user recognition:
Or do I miss something ?
I Believe that retroactive user recognition is inherently something that should not be real time
"retroactive" -> after the fact
That's my intuition.
It's time to revisit this given a new architecture of Jitsu Next and the fact that we use Mongo as an underlying storage for user recognition. Here's a preliminary design:
user-recognition
function should send a specially formated event that instructs Bulker to update certain fields in a tableupdate
operation by a certain criteria ob by a set of messagesWe should think through design in details, generally speaking it's easy to do if the database allows to pull events by id
Problem
According to docs the current implementation stores all anonymous events in Redis This has two significant downsides:
As you pointed out in the documentation:
10M events / month is really not that much. its ~ 231 events per minute or 3.4 events per second.
Large scale tracking load is mesured in thousends of events per second.
At this point redis ram consumption will explode.
Solution
Implement Retroactive User Recognition is a background task, no need to store those events in a hot cache like redis. Instead they can be stored in any cloud storage as files ( under a path containing the user anonymous id ) This solves the ram consumption problem. And once a user is identified
a background process can update the records according with the identified user.
In this scenario redis will only contain the coordinating info ( and can be updated asyncronuesly )