Insufficient database design

longnd commented 5 months ago

Issue

The current database design has some limitations:

The keywords are shared by all users, meaning that when a user logs in, they see all keywords uploaded by other users. This makes it difficult for each user to view their own reports. It would make more sense for each user's data to be accessible only to them.
Search results are overwritten if a keyword is uploaded again and is shared by all users. This is not appropriate because each time we search, even for the same keyword, the results may be different. For example, searching "weather forecast" may yield different results each day. Additionally, when the same keyword is searched by different users in different geographic locations (Vietnam and Thailand), the results will differ. Therefore, it would be more logical to keep the search results and create new ones each time a user searches. Furthermore, the search results should be scoped by user.

Pls note that we do not list all of those details in the code challenge's requirements as it allows us to evaluate the analysis skills of the candidates, and how they design the system that makes the most sense.

Expected

Keyword and search result are scoped by users.

tanaponpiti commented 5 months ago

The keywords are shared by all users, meaning that when a user logs in, they see all keywords uploaded by other users. This makes it difficult for each user to view their own reports. It would make more sense for each user's data to be accessible only to them.

This is one of the point that I have put a lot of thought on to it. Normally, if I'm going to implement some application, I wish to understand how user will use this application and for what purpose. I understand that it use to perform massive google search scraping, but did not understand for what purpose or the value of data that they received from doing that. Since, I did not know the value behind these search data, I did not know if there is any point to have data boundary for each users. My opinion is that all of this search data are publicly to begin with, should I really want that boundary in my application? and what will they gain from having those boundary? So, I'm thinking in an opposite way "What will they gain from having no boundary at all?" and I found many benefits from not having boundary.

We can share search result from another users so they can easily search for existing result without having to re-scrape it.
We can reduce cost of scraping same keyword over and over again
In case of two user upload their same keyword concurrently at the same time, we can scrape only one of it without duplication.

This is just my thought at that time but I design database based on these idea. However, you do have a valid point of the report is confusing when there data sharing from another users. If we are expecting data to be seen separately for each user then I have to adjust database structure to have user ID along with each model and then add it as a new query criteria.

Search results are overwritten if a keyword is uploaded again and is shared by all users. This is not appropriate because each time we search, even for the same keyword, the results may be different. For example, searching "weather forecast" may yield different results each day. Additionally, when the same keyword is searched by different users in different geographic locations (Vietnam and Thailand), the results will differ. Therefore, it would be more logical to keep the search results and create new ones each time a user searches. Furthermore, the search results should be scoped by user.

Search result is actually not overwritten but historically store in the database as define in this model.

https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/model/keyword_model.go#L7-L13

I have design that one keyword can have multiple search result. This search result also show up in the API response for sorting by newest to oldest. So, every time user perform search it will create new record of this

https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/model/search_result_model.go#L5-L16

which will link with keyword data by field KeywordID. Sadly, I didn't have enough time to implement it to visible in frontend.

Still, I do think differently about different users in different geographic locations as even user use this application in different country the result will be come from same destination, which is the location of where puppeteer locate, currently for google cloudrun it Singapore. The result will come out differently from times to times but not location to location.

longnd commented 5 months ago

Search result is actually not overwritten but historically store in the database as define in this model. I have design that one keyword can have multiple search result. This search result also show up in the API response for sorting by newest to oldest

You're right. I re-checked the implementation for that part. I also see multiple search results are included in the response for a keyword, but only the latest one is shown to the user.

The result will come out differently from times to times but not location to location.

You're also right on this, given that the server handle the search request, the results also tie to the location of the server.

Given that the search result is different each time we run the search query, even for the same keyword, and the user should be able to see all of their past reports (e.g. they want to keep a history of top results for "Golang" keyword each month), the keyword & their search history should be scoped by user.

I have to adjust database structure to have user ID along with each model and then add it as a new query criteria

I understand your approach so no need to make change for this :)

tanaponpiti / google-search

Insufficient database design #18

Issue

Expected