openpredictionmarkets / socialpredict

Easy to Deploy Prediction Market Platform
https://github.com/openpredictionmarkets/socialpredict
MIT License
63 stars 11 forks source link

Data Repository #343

Open pwdel opened 1 month ago

pwdel commented 1 month ago

From: https://github.com/openpredictionmarkets/socialpredict/pull/299#pullrequestreview-2319522742

Do we need the repository package/database abstraction? If so, what benefits does it give us? Should the repository package be a separate package or just built on top of the models that are stored in our database type?

pwdel commented 1 month ago

Ok, my thoughts:

Performance Bottlenecks and Database I/O:

In systems where large volumes of data, such as bets in a prediction market, are being processed, the primary bottleneck is often the database I/O rather than CPU usage. For example, if we are returning an entire model—let's say a million bets—just to calculate a simple statistic in Golang, this puts a significant and unnecessary load on the I/O. Instead, leveraging the database to perform a GROUP BY operation and return only the aggregated result (e.g., a count) dramatically reduces the data being transferred.

This approach allows the database, which is written in C and highly optimized for such operations, to handle tasks like counting or grouping, thereby reducing the overhead on both I/O and CPU in the Golang application. By opinionatedly designing our system to use the database for what it's good at, we minimize inefficient Golang operations such as iterating over large datasets in for loops, enhancing overall performance.

The overall objective of the project is to allow individuals or small organizations to run prediction markets, perhaps ideally with up to 1000 users on an around $100/year or so hosting budget, or to get as close to that as we can. @j4qfrost brought up the notion of injecting C math libraries as a way to deal with bottlenecks. This is a more, "data engineering," approach, e.g. just write certain types of queries for particular types of calculations in a way that we know will reduce the load on the database.

Whereas injecting C math libraries would deal with performance on the CPU for the application, it wouldn't deal with I/O concerns. There is an I/O creep that will happen if we don't control for it at some point.

ORM Raw Concerns

ORM and Raw SQL Queries: While GORM provides a convenient abstraction GROUP BY seems to be not well supported by the ORM’s methods, e.g. it only allows a single string input, when we need to actually group by two fields to get a count of total number of first time bets across all markets. In cases like this, the ability to execute raw SQL is the only way to perform this operation.

While this does expose us to having RAW SQL, which is undesirable, we could make exceptions for where GORM lacks library functionality, while also having a policy of not allowing RAW() writes ever, and reducing the number of raw queries, "use the library except in these circumstances."

Why Not Couple Queries With Models

My thought is that we need a way of conceptually keeping our queries and how we are performing them separate than the models themselves as a way to highlight that queries can be costly, and to restrict the number of queries we can put in place.

Now obviously having a RAW method may seem to go against this, but as mentioned in the above section, this is in there for exceptions, not as a rule. The intention is not to fully restrict according to a library, but rather to restrict against a set of functions that we pay attention to and maintain to try to get the app going in a performant direction from the start.

Maintainability

At this point I don't think there are a huge number of ways we interact with the database. If anything having a repo model will control the number new of different types of queries that we write and ways to solve problems. While it may slow down certain buildouts and features, we could hypothetically allow for regular gorm usage for the buildout of new features, while requiring that a repo model must be put in place after the new feature is built and completed.

pwdel commented 1 month ago

@j4qfrost and @ajlacey

The above conversation is attempting to address only the question of, "should we do this," ... related to my current MR / statsreporting. Everything else in terms of reducing the code size and all other comments that you both made I intend on fixing per what you suggested...I'm just trying to offer an explanation as to why a data repo, vs. coupling functions with the models.

pwdel commented 1 month ago

I set up this MR to help discuss the topic above.

https://github.com/openpredictionmarkets/socialpredict/pull/352

Citations: