hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

[SHC-330] Adding ability to set first and last splits, so that non-alphabetic splits can be easily supported #331

Closed davidov541 closed 4 years ago

davidov541 commented 4 years ago

What changes were proposed in this pull request?

Adds two properties that allow the first and last split to be set by the user. This allows the use of more complex split preferences, as well as enables scenarios such as numeric-only keys, which were underserved in earlier versions.

How was this patch tested?

Two unit tests were added. One tests that for alphabetic keys, the rows are dispersed between regions correctly. The other uses numeric keys and sets the first and last splits accordingly. This also checks that the rows have been dispersed appropriately across regions.

davidov541 commented 4 years ago

Fixes issue https://github.com/hortonworks-spark/shc/issues/330

peter-toth commented 4 years ago

Looks good to me, @dongjoon-hyun could you pleas take a look as well?

davidov541 commented 4 years ago

@peter-toth : We don't need maven.yml. I suspect that was added automatically when I was trying to set up a maven repository to help the customer, and didn't realize it was included as part of the PR. It's now been removed.

OneCricketeer commented 4 years ago

Might want to add .github/workflows to the gitignore until if/when TravisCI is migrated to Github Actions.

davidov541 commented 4 years ago

@cricket007 Fixed the internal names. Thanks!

davidov541 commented 4 years ago

Thanks, @joshelser. I'm working on a rename of the parameter to match your request. I think that seems reasonable. Not being an HBase expert myself, I wasn't clear on the difference there, but I get it.

As for the README, I agree. In general, the README could use a lot of work. My understanding is that this project is EOL essentially. If so, is it still worth updating the README at this point in time? If the PR will be reviewed and accepted in a reasonable amount of time, I'd be down to update the README at the very least for this portion.

joshelser commented 4 years ago

please review before I merge it.

Looks like the configuration names changed, but not to what I had suggested: minRegionSplitPoint and maxRegionSplitPoint. I suggested minTableSplitPoint and maxTableSplitPoint. We're adding splits to a table which creates more regions. We're not adding split points to Regions. Sorry for being pedantic.

My understanding is that this project is EOL essentially. If so, is it still worth updating the README at this point in time

Great question that I don't have a good answer for :). Usually I follow a rule of "if I come back to something for a second time, I should fix the documentation because there will be a third time." Ultimately it's up to you. Hopefully we have the right folks with the right permissions to make changes going forward.

davidov541 commented 4 years ago

@joshelser : Don't worry about it. It's my fault for not paying enough attention. I'll get on that and update the PR in the next week or two.

I'll go ahead and make the change we've discussed to the README and submit a PR to everyone at some point in the next week.

peter-toth commented 4 years ago

Thanks @davidov541 for your work. I've merged it to master.