rohancme commented 2 years ago

AC:

Document Choice and Reasons here

rohancme commented 2 years ago

What are the requirements for this DB?

I am building an activity/task retrieval system. I'll request an activity from the system with some criteria (duration, tags, etc) and the system returns a list of activities that match my criteria.

The database will contain a list of activities I want to complete. These activities could be standalone, or part of an activity group. An activity group is a collection of activities. Activities in a group could be ordered or unordered.

Activities

An activity will have:

tags: what kind of activity is this? It could have multiple tags. For example, a recipe could have tags "indoors", "cooking"
expected time/duration
[foreign key] associated story
- if there is an associated story then the task could have some order in that story
A title
A description

Stories

A story will have:

A title
A description
A list of activities (ordered/unordered)

Story types

There are two kinds of stories:

Unordered
Ordered

An unordered story is just a collection and adds no other value. Once I've completed all the activities in a particular story, that story is complete. I can think of some use cases for this, but it is gravy for what I want to build. Thus, unordered stories are not P0 but I think I can get them for free when I build support for ordered stories (which is a P0).

An ordered story contains activities in a specific order.

Users

I will also probably model users in this system eventually, but that should only require an additional table for users and a column on the story, activity and tag tables to add a foreign key to user ID. I won't cover that in this analysis, but I'll include that when I actually create my DB

rohancme commented 2 years ago

Modeling (Relational)

Schema

If I were to model this as a relational DB, I would have 4 tables:

Activity (with a foreign key to the storyId and a priority column, representing that activity's priority in its story)
Tag
Activity-Tag table
Story

Use cases

Lets work through a few sample use cases with a relational DB.

1. Return activities that match the given criteria

E.g. Find activities that match the following criteria:

< 2hours
one of the following tags: "outdoors", "cooking"

select distinct(a.id) from activity a 
inner join
(select storyid, min(storypriority) as min_priority from activity 
  where status = 'incomplete'
  group by storyid) as a2
  on a.storyid = a2.storyid
  and a.storypriority = a2.min_priority
 inner join
   activitytag atag on a.id = atag.activityid
 where atag.tagid in (2,3)

This seems pretty reasonable & straightforward enough to me

2. Reordering activities within a story

I don't actually need good performance for reordering for P0 (i.e. a whatsapp bot that is built exclusively for me) because I'll be updating the DB manually. However, I think it is worth thinking through because I might need to support this in the future.

Realistically, I'm not expecting > 100 incomplete activities per story. The most inefficient way to do this update is to always accept the entire list as input and individually update the priority for all records. I actually do not believe this will result in a noticeable performance hit, at least from a DB perspective. I should be able test this out pretty easily with my local postgres instance.

Update statement will look something like:

update activity as a set
  storypriority = a2.storypriority
from (values
(14,1),
(15,2),
(16,3),
(17,4),
(18,5),
(19,6),
...
) as a2(id, storypriority)
where a2.id = a.id;

updating all activities in a story with 2 activities:

100 activities:

1000 activities:

I am definitely comfortable with this performance to start with

3. Insertions/Deletes

Inserting at the end or deleting activities in a story will be pretty straightforward so I'm not covering it in my analysis.

rohancme commented 2 years ago

Modeling (NoSQL) with Firestore

Schema

In Firestore, I'd have a structure that looks something like:

Users [Root Level Collection]
- Stories [Sub Collection inside a User document]
- Activities [Sub collection inside a Story document]
  - Tags [Nested List] (Subset of the values in the user's tag list)
  - UserId [Ref to a user in the root level users collection]
- Tags [Nested List inside a User document] (This is to maintain a list of user-specific tags)

This design makes it really easy to integrate with Firebase authentication and allow specific users access to only their own subcollections. However, I will need to create an index on the Activities sub collection because I need support for collection group queries to run queries on activities across stories.

Use Cases

Walking through some use cases:

1. Return activities that match some criteria

E.g. Find incomplete activities that match the following criteria:

< 2hours
one of the following tags: "outdoors", "cooking"

Firestore does not support aggregation queries (find the activities with the lowest priority value per story).

To make this query work with Firestore I will have to:

Query for all activities that are "incomplete" (using == query function on the /activities property) and match the tag (usingarray-contains-any` query function on the tags property)
```
// Create a reference to the activities subcollection
const activitiesRef = db.collection('activities')
```

// Create a query against the subcollection const queryRef = activitiesRef .where('status', '==', 'incomplete') .where('duration', '<', 120) .where('tags', 'array-contains-any', ["outdoors", "cooking"]) .where('userId', '==', '[userId]') .orderBy('duration') // if you use a where clause the first orderBy must use that same property .orderBy('storyId') .orderBy('storyPriority')



2. Run aggregation logic client-side to find all matching activities that are at the lowest available priority per story

### 2. Reordering activities within a story

In Firestore, I would use a batch write operation to update all the activities within a story with the new priorities. Each individual operation counts against the write limit.

## Limitations:
I will need to start paginating and caching data locally if there is a large number of activities that match the activities query. This is because I need to aggregate locally, i.e. I cannot filter out activities within stories that have a higher priority than the lowest valid priority. However, I'm pretty confident I'm not going to hit this limit any time soon.

## Advantages:
Integrates with Firebase Authentication, which means building ios/web clients becomes a lot simpler.

rohancme commented 2 years ago

I have 3 viable options here:

Cloud SQL (managed relational DBs)
Spanner (managed, horizontally scalable relational DB)
Firestore (nosql, massively scalable, sort of comparable to DynamoDB)

rohancme commented 2 years ago

Pricing for single-region, smallest non-shared instance and 10gb of storage:

Cloud SQL - $50/month
Spanner - $70/month

Firestore is pay as you go, and is pretty much free until I hit reasonably high scale.

rohancme commented 2 years ago

Decision

I'm going to move forward with Firestore because:

The integration with Firebase authentication seems really powerful
I am not seeing any significant advantages of a SQL DB over this for my use case (I would definitely use SQL if I had to run a ton of server-side aggregation queries)
It's free to start + pay as you go
Fully managed + autoscales
I get to learn something new!

Concerns with using firestore:

It's GCP. They might kill support for this at any time

rohancme / freya

Choose a Google DB technology #1

What are the requirements for this DB?

Activities

Stories

Story types

Users

Modeling (Relational)

Schema

Use cases

1. Return activities that match the given criteria

2. Reordering activities within a story

3. Insertions/Deletes

Modeling (NoSQL) with Firestore

Schema

Use Cases

1. Return activities that match some criteria

Decision