Data Repository - Githubissues

pwdel commented 1 month ago

From: https://github.com/openpredictionmarkets/socialpredict/pull/299#pullrequestreview-2319522742

Do we need the repository package/database abstraction? If so, what benefits does it give us? Should the repository package be a separate package or just built on top of the models that are stored in our database type?

pwdel commented 1 month ago

Ok, my thoughts:

Performance Bottlenecks and Database I/O:

In systems where large volumes of data, such as bets in a prediction market, are being processed, the primary bottleneck is often the database I/O rather than CPU usage. For example, if we are returning an entire model—let's say a million bets—just to calculate a simple statistic in Golang, this puts a significant and unnecessary load on the I/O. Instead, leveraging the database to perform a GROUP BY operation and return only the aggregated result (e.g., a count) dramatically reduces the data being transferred.

This approach allows the database, which is written in C and highly optimized for such operations, to handle tasks like counting or grouping, thereby reducing the overhead on both I/O and CPU in the Golang application. By opinionatedly designing our system to use the database for what it's good at, we minimize inefficient Golang operations such as iterating over large datasets in for loops, enhancing overall performance.

The overall objective of the project is to allow individuals or small organizations to run prediction markets, perhaps ideally with up to 1000 users on an around $100/year or so hosting budget, or to get as close to that as we can. @j4qfrost brought up the notion of injecting C math libraries as a way to deal with bottlenecks. This is a more, "data engineering," approach, e.g. just write certain types of queries for particular types of calculations in a way that we know will reduce the load on the database.

Whereas injecting C math libraries would deal with performance on the CPU for the application, it wouldn't deal with I/O concerns. There is an I/O creep that will happen if we don't control for it at some point.

ORM Raw Concerns

ORM and Raw SQL Queries: While GORM provides a convenient abstraction GROUP BY seems to be not well supported by the ORM’s methods, e.g. it only allows a single string input, when we need to actually group by two fields to get a count of total number of first time bets across all markets. In cases like this, the ability to execute raw SQL is the only way to perform this operation.

While this does expose us to having RAW SQL, which is undesirable, we could make exceptions for where GORM lacks library functionality, while also having a policy of not allowing RAW() writes ever, and reducing the number of raw queries, "use the library except in these circumstances."

Why Not Couple Queries With Models

My thought is that we need a way of conceptually keeping our queries and how we are performing them separate than the models themselves as a way to highlight that queries can be costly, and to restrict the number of queries we can put in place.

Now obviously having a RAW method may seem to go against this, but as mentioned in the above section, this is in there for exceptions, not as a rule. The intention is not to fully restrict according to a library, but rather to restrict against a set of functions that we pay attention to and maintain to try to get the app going in a performant direction from the start.

Maintainability

At this point I don't think there are a huge number of ways we interact with the database. If anything having a repo model will control the number new of different types of queries that we write and ways to solve problems. While it may slow down certain buildouts and features, we could hypothetically allow for regular gorm usage for the buildout of new features, while requiring that a repo model must be put in place after the new feature is built and completed.

pwdel commented 1 month ago

@j4qfrost and @ajlacey

The above conversation is attempting to address only the question of, "should we do this," ... related to my current MR / statsreporting. Everything else in terms of reducing the code size and all other comments that you both made I intend on fixing per what you suggested...I'm just trying to offer an explanation as to why a data repo, vs. coupling functions with the models.

pwdel commented 1 month ago

I set up this MR to help discuss the topic above.

https://github.com/openpredictionmarkets/socialpredict/pull/352

This is a small demo of one function that I see as creating a huge amount of Data I/O in exchange for a small amount of information that could be used on a repetitive basis (the stats page, for calculating total fees across the platform).
One way that this function could be written in Gorm is to have Gorm pull the entire model down, and then we process it in Golang. If we pull the entire model down, that's more intensive on the I/O than just pulling a statement which had only grabbed two columns and then pre-grouped them, so we're just getting the minimal info we need - a count of all times a user interacted at least once with a market.
The golang processing is just putting that information into a map rather than sorting through the entire table.
Now that I have built this function, it seems like it could be used again, which would go into refactor-ability.
Even if it's not being used again, this demonstrates how we could write highly efficient queries, which make use of Postgres, have Postgres, SQL Lite, or whtaever database we choose do what it is optimized for, and only extract the minimal amount of information we need, rather than have full database pulls every time, which is what we are doing now.
Designing a set of common functions that are used for large data tasks seems reasonable because it seems like the database I/O is going to be the bottleneck.
This may not come into play immediately, but as an individual user using our software grows in size, if they want to continue to use cheap commodity hardware, e.g. if they have for example 1000 users and those users are making 100 bets each per month, then that could go up to 1.2 million bets per year, meaning a 1.2 million row table, if that gets a lot of concurrent usage, then on a Digital Ocean server using 1cpu and 1GB of RAM, that might start to get unmanageable. Of course, this is difficult to predict.
That being said, at the very least, a repo could be a way to create a controlled way of writing functions, while allowing us the ability to tweak what kind of database or ORM we use in the future if need be.

Citations:

More concurrent users means more RAM usage. https://dba.stackexchange.com/questions/106953/does-postgresql-benefit-from-more-cores-or-faster-cores-for-read-heavy-applicati
Hardware bottleneck in postgres usage example: https://stackoverflow.com/questions/24066563/sequential-scan-in-postgres-is-taking-surprisingly-long-how-do-i-determine-the
More example of slowness on postgres https://stackoverflow.com/questions/59994078/postgresql-query-against-millions-of-rows-takes-long-on-uuids

openpredictionmarkets / socialpredict

Data Repository #343

Performance Bottlenecks and Database I/O:

ORM Raw Concerns

Why Not Couple Queries With Models

Maintainability