rohancme / freya

A whatsapp chat bot to plan my days
GNU General Public License v3.0
0 stars 0 forks source link

Choose a Google DB technology #1

Open rohancme opened 2 years ago

rohancme commented 2 years ago

AC:

rohancme commented 2 years ago

What are the requirements for this DB?

I am building an activity/task retrieval system. I'll request an activity from the system with some criteria (duration, tags, etc) and the system returns a list of activities that match my criteria.

The database will contain a list of activities I want to complete. These activities could be standalone, or part of an activity group. An activity group is a collection of activities. Activities in a group could be ordered or unordered.

Activities

An activity will have:

  1. tags: what kind of activity is this? It could have multiple tags. For example, a recipe could have tags "indoors", "cooking"
  2. expected time/duration
  3. [foreign key] associated story
    • if there is an associated story then the task could have some order in that story
  4. A title
  5. A description

Stories

A story will have:

  1. A title
  2. A description
  3. A list of activities (ordered/unordered)

Story types

There are two kinds of stories:

  1. Unordered
  2. Ordered

An unordered story is just a collection and adds no other value. Once I've completed all the activities in a particular story, that story is complete. I can think of some use cases for this, but it is gravy for what I want to build. Thus, unordered stories are not P0 but I think I can get them for free when I build support for ordered stories (which is a P0).

An ordered story contains activities in a specific order.

Users

I will also probably model users in this system eventually, but that should only require an additional table for users and a column on the story, activity and tag tables to add a foreign key to user ID. I won't cover that in this analysis, but I'll include that when I actually create my DB

rohancme commented 2 years ago

Modeling (Relational)

Schema

If I were to model this as a relational DB, I would have 4 tables:

  1. Activity (with a foreign key to the storyId and a priority column, representing that activity's priority in its story)
  2. Tag
  3. Activity-Tag table
  4. Story

Use cases

Lets work through a few sample use cases with a relational DB.

1. Return activities that match the given criteria

E.g. Find activities that match the following criteria:

select distinct(a.id) from activity a 
inner join
(select storyid, min(storypriority) as min_priority from activity 
  where status = 'incomplete'
  group by storyid) as a2
  on a.storyid = a2.storyid
  and a.storypriority = a2.min_priority
 inner join
   activitytag atag on a.id = atag.activityid
 where atag.tagid in (2,3)

This seems pretty reasonable & straightforward enough to me

2. Reordering activities within a story

I don't actually need good performance for reordering for P0 (i.e. a whatsapp bot that is built exclusively for me) because I'll be updating the DB manually. However, I think it is worth thinking through because I might need to support this in the future.

Realistically, I'm not expecting > 100 incomplete activities per story. The most inefficient way to do this update is to always accept the entire list as input and individually update the priority for all records. I actually do not believe this will result in a noticeable performance hit, at least from a DB perspective. I should be able test this out pretty easily with my local postgres instance.

Update statement will look something like:

update activity as a set
  storypriority = a2.storypriority
from (values
(14,1),
(15,2),
(16,3),
(17,4),
(18,5),
(19,6),
...
) as a2(id, storypriority)
where a2.id = a.id;

updating all activities in a story with 2 activities:

Screen Shot 2022-08-30 at 4 15 10 PM

100 activities:

Screen Shot 2022-08-30 at 4 21 52 PM

1000 activities:

Screen Shot 2022-08-30 at 4 26 45 PM

I am definitely comfortable with this performance to start with

3. Insertions/Deletes

Inserting at the end or deleting activities in a story will be pretty straightforward so I'm not covering it in my analysis.

rohancme commented 2 years ago

Modeling (NoSQL) with Firestore

Schema

In Firestore, I'd have a structure that looks something like:

This design makes it really easy to integrate with Firebase authentication and allow specific users access to only their own subcollections. However, I will need to create an index on the Activities sub collection because I need support for collection group queries to run queries on activities across stories.

Use Cases

Walking through some use cases:

1. Return activities that match some criteria

E.g. Find incomplete activities that match the following criteria:

Firestore does not support aggregation queries (find the activities with the lowest priority value per story).

To make this query work with Firestore I will have to:

  1. Query for all activities that are "incomplete" (using == query function on the /activities property) and match the tag (usingarray-contains-any` query function on the tags property)
    
    // Create a reference to the activities subcollection
    const activitiesRef = db.collection('activities')

// Create a query against the subcollection const queryRef = activitiesRef .where('status', '==', 'incomplete') .where('duration', '<', 120) .where('tags', 'array-contains-any', ["outdoors", "cooking"]) .where('userId', '==', '[userId]') .orderBy('duration') // if you use a where clause the first orderBy must use that same property .orderBy('storyId') .orderBy('storyPriority')



2. Run aggregation logic client-side to find all matching activities that are at the lowest available priority per story

### 2. Reordering activities within a story

In Firestore, I would use a batch write operation to update all the activities within a story with the new priorities. Each individual operation counts against the write limit.

## Limitations:
I will need to start paginating and caching data locally if there is a large number of activities that match the activities query. This is because I need to aggregate locally, i.e. I cannot filter out activities within stories that have a higher priority than the lowest valid priority. However, I'm pretty confident I'm not going to hit this limit any time soon.

## Advantages:
Integrates with Firebase Authentication, which means building ios/web clients becomes a lot simpler.
rohancme commented 2 years ago

I have 3 viable options here:

  1. Cloud SQL (managed relational DBs)
  2. Spanner (managed, horizontally scalable relational DB)
  3. Firestore (nosql, massively scalable, sort of comparable to DynamoDB)
rohancme commented 2 years ago

Pricing for single-region, smallest non-shared instance and 10gb of storage:

  1. Cloud SQL - $50/month
  2. Spanner - $70/month

Firestore is pay as you go, and is pretty much free until I hit reasonably high scale.

rohancme commented 2 years ago

Decision

I'm going to move forward with Firestore because:

  1. The integration with Firebase authentication seems really powerful
  2. I am not seeing any significant advantages of a SQL DB over this for my use case (I would definitely use SQL if I had to run a ton of server-side aggregation queries)
  3. It's free to start + pay as you go
  4. Fully managed + autoscales
  5. I get to learn something new!

Concerns with using firestore:

  1. It's GCP. They might kill support for this at any time