hnwyllmm / snip

snip code
1 stars 0 forks source link

测试一下中文翻译成英语是否好用 #6

Open hnwyllmm opened 2 months ago

hnwyllmm commented 2 months ago

Question

我看官方文档中存在如下描述:

ob-spec

如果有一个电商 SaaS,【订单表】有 商户ID( int ) 和 订单号( varchar(32) ) 存在唯一约束,这个时候我们也应该用 这两个字段 做联合主键吗?

众所周知,MySQL 采用的是 B+Tree 的结构来存储数据,一般建议 ID是自增的,这样可以避免随机插入,导致 MySQL 数据页频繁分裂,进而导致检索数据时 IO 开销增加。

然而,商户ID + 订单号 也很明显不是自增的,所以我才有此疑问。 对于这种场景,如果 OceanBase 仍然推荐使用它们联合作为主键,是因为 LSM-Tree 能够很好地处理 裂页 问题么 ?

此外,除了主键索引,我们一般还会额外在关键业务表上创建几个二级索引。 在 MySQL 中,因为每个二级索引的叶子节点都会存储相应行的主键值。所以,如果主键较大,那么每个二级索引的叶子节点所需的空间也会更大,因为它们需要存储更大的主键值。

如果主键是 bigint 的自增ID,在每个二级索引上存储,关联主键也就占用 8个字节 的空间。 如果主键是 int + varchar(32) 这种联合索引,是不是就意味着,关联主键也要占用至少 4+32 个字节的空间呢? 如果有多个二级索引(一般2~4个),是不是也要重复多占用不少空间 ?这样的话,存储空间 以及 维护这些二级索引的开销不是也增大了许多么 ?

Issues-translate-bot commented 2 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: Test whether the translation from Chinese to English is easy to use

Question

I read [Official Document](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000640146#%E6%99%AE%E9%80%9A%E8%A1%A8%E7% The following description exists in BB%93%E6%9E%84%E8%AE%BE%E8%AE%A1%E8%A7%84%E8%8C%83):

ob-spec

If there is an e-commerce SaaS, the [order table] has merchant ID (int) and order number (varchar(32)) with unique constraints. Should we also use these two fields as joint primary keys at this time?

As we all know, MySQL uses the B+Tree structure to store data. It is generally recommended that the ID be auto-incremented. This can avoid random insertion, which will lead to frequent splitting of MySQL data pages, which will lead to increased IO overhead when retrieving data.

However, the merchant ID + order number is obviously not self-increasing, so I have this question. For this scenario, if OceanBase still recommends using their union as the primary key, is it because LSM-Tree can handle the split page problem well?

In addition, in addition to the primary key index, we generally create several additional secondary indexes on key business tables. In MySQL, because the leaf node of each secondary index stores the primary key value of the corresponding row. So, if the primary key is larger, the space required for each secondary index leaf node will also be larger because they need to store larger primary key values.

If the primary key is a bigint auto-increment ID, stored on each secondary index, the associated primary key will also occupy 8 bytes of space. If the primary key is a joint index such as int + varchar(32), does it mean that the associated primary key also takes up at least 4+32 bytes of space? If there are multiple secondary indexes (usually 2 to 4), will they also occupy a lot of space? In this case, wouldn't the storage space and the cost of maintaining these secondary indexes also increase a lot?

hnwyllmm commented 2 months ago

先说“如果主键是 int + varchar(32) 这种联合索引,是不是就意味着,关联主键也要占用至少 4+32 个字节的空间呢?” --> 在OceanBase 中 int也是按照8位来存储的,所以这里应该就是 8+32=40字节,一条数据比bigint自增ID多了32个字节。按照1亿个数据量来算,就多了大概3G的存储。如果有3个索引,也不过10G的存储。先不说你肯定不在乎这10G的存储空间(毕竟业务量都上亿了),OceanBase在存储时还会压缩的,最终的数据,即使是多副本,也通常比MySQL占用的存储空间少了50%,具体可以看看官网上的一些案例描述,或许会有案例可以帮到你们。


再看“自增ID作为主键”的问题,在分布式数据库上这是一个老生常谈的话题,看看gpt的回答:

写入热点:自增主键意味着新插入的行总是拥有比之前高的主键值。这在分布式数据库中可能会造成主键值集中在某个范围内,从而导致所有新的插入操作都集中在特定的分区或节点上。这种情况创建了所谓的“写入热点”,可能会降低系统的性能和扩展性。

分布式事务冲突:在分布式系统中保持主键的唯一自增值需要跨多个节点协调,这可能会导致无法避免的分布式事务冲突和协调开销。同步自增值可能会导致延迟和额外的性能损耗。

扩展性限制:自增主键强制了数据的插入顺序,这可能导致数据在物理存储上的不均匀分布,对于需要水平扩展的分布式数据库来说,这是一个不利因素。扩展操作需要平衡和迁移数据,而自增主键可能会使这一过程更加复杂和低效。


网上也有很多自增主键的博客。大概意思都是自增主键在非分区数据库上都挺和谐,但是分区数据库上很困难。 对此,OceanBase还对自增列做了优化,参考自增 noorder 模式:https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000751258


在来看“对于这种场景,如果 OceanBase 仍然推荐使用它们联合作为主键,是因为 LSM-Tree 能够很好地处理 裂页 问题么 ?” LSM-Tree不像B+树,没有列页的问题。新来的数据更新,不管是新增、更新还是删除,都是先在内存上操作的。在一定的平衡场景下,性能不会受到“数据太分散”相关的影响。而LSM的合并操作,都是在后台进行的。

Issues-translate-bot commented 2 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Answering your questions out of order. Let me first say, "If the primary key is a joint index such as int + varchar(32), does it mean that the associated primary key will also occupy at least 4+32 bytes of space?" --> In OceanBase, int is also stored in 8 bits, so it should be 8+32=40 bytes. A piece of data is 32 bytes more than the bigint auto-increment ID. Calculated based on the amount of data of 100 million, there is approximately 3G of additional storage. If there are 3 indexes, it will only take 10G of storage. Putting aside the fact that you definitely don’t care about the 10G storage space (after all, the business volume is hundreds of millions), OceanBase will also compress it during storage. The final data, even with multiple copies, usually takes up 50% less storage space than MySQL. %, for details, you can check out some case descriptions on the official website (https://www.oceanbase.com/customer/home). There may be cases that can help you.


Let’s look at the issue of “self-increasing ID as primary key”. This is a common topic in distributed databases. Take a look at gpt’s answer:

Write hotspots: Auto-incrementing primary keys means that newly inserted rows always have a higher primary key value than the previous one. In a distributed database, this may cause primary key values ​​to be concentrated in a certain range, causing all new insert operations to be concentrated on a specific partition or node. This situation creates so-called "write hotspots" that can reduce system performance and scalability.

Distributed transaction conflicts: Maintaining a unique auto-increment value for a primary key in a distributed system requires coordination across multiple nodes, which may lead to unavoidable distributed transaction conflicts and coordination overhead. Synchronous auto-increment may cause latency and additional performance loss.

Scalability limitations: Auto-incrementing primary keys enforce the insertion order of data, which may lead to uneven distribution of data on physical storage. This is a disadvantage for distributed databases that require horizontal expansion. Scaling operations require balancing and migrating data, and auto-incrementing primary keys can make this process more complex and inefficient.


There are also many blogs on the Internet that add primary keys. The general meaning is that auto-incrementing primary keys are quite harmonious on non-partitioned databases, but are very difficult on partitioned databases. In this regard, OceanBase has also optimized the auto-increment column. Please refer to the auto-increment noorder mode: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000751258


Let's look at "For this scenario, if OceanBase still recommends using their union as the primary key, is it because LSM-Tree can handle the split page problem well?" Unlike B+ tree, LSM-Tree does not have the problem of column pages. New data updates, whether they are added, updated or deleted, are first operated on the memory. In a certain balanced scenario, performance will not be affected by "data is too scattered". The merge operations of LSM are all performed in the background.