simplesteph / kafka-connect-github-source

Get a stream of issues and pull requests for your chosen GitHub repository
https://links.datacumulus.com/kafka-connect-coupon
MIT License
445 stars 190 forks source link

Changed configuration to support multiple repositories and topic #10

Closed giampaolotrapasso closed 3 years ago

giampaolotrapasso commented 6 years ago

Changed configuration to support multiple repositories and topic (issue #2). I did not create a single client. First idea was to create a static singleton, but not sure how it would work on a distributed connector. I need to check better how connector works on a distributed environment.

In case, it can be done one a different issue. Let me know what do you think.

giampaolotrapasso commented 6 years ago

Ok, now I see your point. Instead of splitting configuration in multiple tasks, you want to keep the code as before but allowing task to query for multiple repositories.

So I have a couple of questions before changing the PR:

first. backward compatibility is important, so in case of a configuration like this

topic=github-issues
github.owner=kubernetes
github.repo=kubernetes
github.repositories=scala/scala:lang-topic,apache/kafka:kafka-topic,apache/cassandra:cassandra-topic

there are 3 possible choices:

  1. ignoring old configuration
  2. ignoring new configuration (backward compatibility)
  3. merging the repos (backward compatibility) are you for 2 or 3?

then second questions is: If task is going to read from multiple repos, which kind of policy should be adopted? Round robin? Being async on futures? What would you suggest?

Let me know and thank you meanwhile for the review and your time.

simplesteph commented 6 years ago
  1. backwards compatibility: new configs should precede over old config (if both are supplied), we deprecate the old config, but we keep both for now

  2. I looked back at my comment and it may not make sense. I like your approach in some ways. I think there are two options. a. ignore the maxTasks parameters. I sort of like this, if we outline it in the documentation. Simplifies user config too b. do your approach, ignoring some repo (may lead to unexpected behaviour) c. do what JdbcConnector does: https://github.com/confluentinc/kafka-connect-jdbc/blob/master/src/main/java/io/confluent/connect/jdbc/JdbcSourceConnector.java#L151 and each task can pull from multiple repos (round robin as you said, or multiple futures and awaiting on them).

To be honest I'm leaning towards 2.a. What do you think?