metadb-project / metadb

Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data
Apache License 2.0
8 stars 2 forks source link

Add parallel read from kafka #66

Closed nazgaret closed 2 months ago

nazgaret commented 4 months ago

Increase consumer numbers Increase connection numbers Tune checkpointSegmentSize default value Change requirements in docs

CLAassistant commented 4 months ago

CLA assistant check
All committers have signed the CLA.

nassibnassar commented 2 months ago

A few thoughts:

Thank you.

MikeTaylor commented 2 months ago

I'll throw in a few lower-level observations, too:

nazgaret commented 2 months ago

I'll throw in a few lower-level observations, too:

  • cmd/metadb/dbx/dbx.go line 167: It's probably a mistake that the NewPool function sets a hardwired MaxConns value of 20, and that mistake is not really addressed by changing the hardwired value to 40. This should be configurable.
  • cmd/metadb/util/util.go line 221. Similar comments apply to the old and new hardwired values for checkpointSegmentSize. Since this value can already be overridden in configuration, changing the default value seems unnecessary.
  • doc/serveradmin.adoc line 22. There seems to be no reason to bump the required version of Go up to 1.22. The top-level go.mod file is definitive, and that says 1.21.
  • go.mod and go.sum. One new dependency is added (golang.org/x/sync), which is used by the new code. So that's reasonable. But upgrades of several other modules are also included, and it's not clear to me whether they are necessary.

Thanks, Mike! 1) We already made these values configurable in the next versions of our changes both connector numbers and checkpointSegmentSize I'll update this PR tomorrow 2)Need to remove it from the doc 3)Intelij idea says this versions of libraries have security issues so I updated it

MikeTaylor commented 2 months ago

Thanks, @nazgaret. Some projects like to preserve a hygiene protocol where each commit does a single thing, so arguably it would be cleaner to make the security-related updates of dependencies in a separate PR. But I will leave it for @nassibnassar to comment on this, since he knows how he's been handling such things up till now.

nazgaret commented 2 months ago

A few thoughts:

  • If you would, please describe and document in detail the changes you are proposing.
  • This pull request appears to cause the server to report prematurely that the resync snapshot is complete, in our case after it processed only 13% of the data.
  • We encountered an error at server startup due to the configured security protocol for Kafka not being used. Please follow the use of configuration values and other conventions in the source code.

Thank you.

Thanks, Nassib! I'll update this PR tomorrow and describe what we have done. Also, I need to take a look at the resync procedure again to fix it. Did I understand you correctly, that I need to use the same security protocol configured for poll procedure to all work with Kafka?

nazgaret commented 2 months ago

Thanks, @nazgaret. Some projects like to preserve a hygiene protocol where each commit does a single thing, so arguably it would be cleaner to make the security-related updates of dependencies in a separate PR. But I will leave it for @nassibnassar to comment on this since he knows how he's been handling such things up till now.

@MikeTaylor, that totally makes sense, I'll remove it from this PR

nassibnassar commented 2 months ago

Thank you very much for this code submission and your work on it. I have added it to the repository at commit 2c821f6ca4751082e2a157be5e3b7564a4c3c07b (not merged) for reference.

I think increasing the number of Kafka consumers is a good general direction. I see that this code submission distributes topics to individual consumers in a group, which is a resourceful approach. It does not appear to address the creation of new topics, although this could be handled in some way at a later time. Also we should be careful about checking for thread safety in moving the stream processor into concurrent threads. For other issues, see my previous comments above.

If possible, I would prefer to take a different approach to scaling consumers. Rather than manually distributing the topics, we should be able to have Kafka assign partitions. This would have advantages of adapting to new topics created in mid-stream and potentially balancing the workload. I have drafted this in commit 54c0c513bed7dc9f3a47813cc9b45020f1a8779f (not yet merged) and will add some comments to outline the approach. Some limited testing has been done; more is needed. I am closing the pull request for now, on the assumption that this approach will work as expected.

Thank you again.