Closed jgeewax closed 9 years ago
/cc @maxluebbe
/cc @dhermes @tseaver : Max might have some spare cycles to work on this, figured you guys should chat.
ISTM that monkey-patching the happybase
third-party module would be a poor choice. I think we need to think of this as "write a happybase API emulation layer" instead (which may or may not itself even use the real happybase
code).
Can you elaborate? Why is it a bad idea? (I have yet to look at the happybase
code/docs.)
I think our goal is that they should be able to port scripts easily (ideally, just changing import happytable
to from gcloud.bigtable import happytable
and adjusting the connection parameters). I don't see interfering with the actual happytable
module's internals as a useful way to approach that goal. For instance, users of gcloud.bigtable
might need to interact with a real happybase
/Thrift
backend (e.g., to move or compare data between it and the bigtable
dataset).
I agree that there may be times where someone would want to talk to Cloud Bigtable and HBase at the same time, so monkey-patching happybase sounds like we'd screw over those multi-tenant situations.
Maybe there's a way to from gcloud import bigtable
and then happybase_compatible_client = bigtable.get_happybase_client()
which uses the real happybase code, but swaps out a connection of sorts at the lower level...
The gcloud.bigtable docs say:
Cloud Bigtable works with Hadoop 2.4 and higher and is interface compatible with Apache HBase 1.0.
If that means that it exports the same REST API as HBase / Thrift, then maybe we don't have to do much to enable happybase
, beyond ensuring that we set the connection up with gcloud-relevant credentials.
This is the interesting piece. It is not wire-protocol compatible. That is -- it doesn't speak Thrift (AFAIK). Instead, they made the Java code compatible, and swapped out the thing that does serialization and sending of requests underneath.
I think @maxluebbe can comment more to confirm.
Yeah that one is important. I was under the impression from @jgeewax previous comments that the web APIs were the same / close cousins.
The Java "HBase" client to Bigtable appears to be an emulation layer. A Python client should be built on the same protobuf rpc interface as that: https://github.com/GoogleCloudPlatform/cloud-bigtable-client/tree/master/bigtable-protos/src/main/proto/google
I'd imagine that Python client having a native mode that just handles the proto-rpc; and then there could be a happybase emulation layer on top of that.
Sorry have been OOO. Some clarifications. 1) Additional work item is modernizing Happybase to 1.0 HBase API. 2) Cloud Bigtable does not support Thrift/Rest interfaces. Part of the work that needs to be done here is decoupling the thrift parts of happybase from the interface provided to the user, so that we can drop in gRPC in place of Thrift.
gRPC or Protobuf over HTTP 1.0?
gRPC is not possible right now given the installation process (we need to keep our install to pip install gcloud
and can't ask people to do a bunch of Makefile madness to use our library).
Is it acceptable for gRPC to be an optional install? (and python bigtable might not work without it, but the rest of gcloud-python would) Maybe spin off the python bigtable client into something not part of 'gcloud-python'? (Seems fair, the Java bigtable client is its own project/repo.)
You're telling me there's no way to talk to Cloud Bigtable's API without using a gRPC client? We can't use HTTP 1.0 ? That seems.... really really weird. @maxluebbe @carterpage can you confirm?
We're designed to be very low latency and high-throughput so all of our focus is on gRPC. There is a REST interface, which comes for free with our envelope, but we don't recommend using for much more than slow admin API calls or ad hoc test queries against the data API. Almost everything in the API is a binary string, including keys and values, so there would be a lot of base64 encoding going on -- a big CPU hit on both sides. We also have not tested streaming scans would work in REST.
Can we help make the gRPC install process easy?
Some references.... (Short answer... not really. The best we have for Linux is sudo apt-get install linuxbrew && sudo brew install grpc
)
I believe the issue here is that gRPC requires some crazy stuff in SSL libs and others, which won't make it into mainstream world for quite some time, and users can't have two versions of those libraries the way apt handles packaging (for Linux).
As for @carterpage 's comment...
We're designed to be very low latency and high-throughput so all of our focus is on gRPC. There is a REST interface, which comes for free with our envelope, but we don't recommend using for much more than slow admin API calls or ad hoc test queries against the data API.
HBase today doesn't speak HTTP 2.0 right? So wouldn't the REST-ful HTTP API calls be comparable to existing HBase clusters in latency and HTTP overhead?
HBase has their own bespoke RPC system that's actually fairly fast. The issue for us is, using Google's plumbing, gRPC is the only option to provide the sort of latencies we need.
RESTful HTTP API calls would be better than running a Thrift server in front of HBase, since that adds an extra hop, but that's a fairly low bar. I think the performance comparison we want to aim for is Happybase against raw HBase. I believe that uses HBase's binary RPC format, so I don't see how we'd match it without gRPC.
Got it. We won't be able to "depend" on gRPC via PyPI, so that means in our docs we'll basically have to say "If you want to use Cloud Bigtable via gcloud-python, you'll need to follow these steps to install gRPC on your machine".
This will clearly have more drop-off, but I suspect we already have a somewhat small pool of people using Cloud Bigtable (given the lowest cost product we offer people is $1500 / month), so adding an extra "here install this" hurdle wouldn't be the end of the world.
Shouldn't the dev exp be even better for the people paying more? (Sorry if that is an unpopular opinion.)
Haha - yes, it should. The developer experience in this case is blazing fast managed HBase -- which is pretty incredible (and what you're paying for).
As far as "installing the client", it seems that our problem with packaging gRPC is not really in our control. The library itself (I'm told) depends on some pretty serious changes to libraries which are in the pipeline but not aimed for main-stream for quite some time. The best we can do (again, I'm told) is using linux brew (brew install grpc
).
JJ hit the nail on the head. This is a pretty bleeding edge service, so there are still some rough edges that will get polished as more network stacks are comfortable with HTTP2, etc.. Our typical customer is technically sophisticated and willing to tolerate a little extra effort for the added power. Hopefully we'll have the best of both worlds before long.
So you're extending existing Python libraries to support HTTP2 or making your own? Why not just ship a custom fork?
Is the issue compiling C extensions for Python or is the issue making new versions of system socket libraries (e.g. things in /usr/lib
)? It should still be possible to just point to local paths for things that would typically be found in /usr/lib
.
Some context for Python 3 support in happybase
(and deps):
https://github.com/wbolster/happybase/issues/40 https://github.com/wbolster/happybase/pull/78 https://github.com/wbolster/happybase/pull/95
Let's not think about Python 3 at this time. Getting a solid Python 2.x experience is our top priority.
@maxluebbe the rest of gcloud-python
is already Python3-compatible. At a minimum, we need all "hard" dependencies to be compatible as well, so that our CI story stays sane. I guess we could keep our code source-compatible, make happylib
a "soft" dependency, and define the wrappers / tests only if it is importable.
@jgeewax Suggested we use happybase
. I'm trying to get a sense of its popularity after @maxluebbe made me step back and ask "why happybase?".
It only has 17 tagged StackOverflow questions (as of June 10 at noon Pacific).
Checking installs is not so small:
$ pip install -U vanity
$ vanity happybase
...
happybase has been downloaded 487933 times!
This is in comparsion to 37,602,602 for boto
and 10,308,673 for httplib2
, so I'm not sure how to gauge it.
Well keep in mind that HBase is far less commonly used... The price point alone would be somewhat prohibitive for people here... (cheapest thing Cloud Bigtable sells is $1500 / month).
Think of it a bit like RedShift, costing $13k/TB/year -- if you could download a RedShift only library I wouldn't expect many installs showing up in vanity...
Also -- about the Python2/3 situation:
We're already effectively saying that you need to manually install gRPC. Would you guys prefer to have a separate library to install that is Python 2 only and requires gRPC? or to make a submodule that we flag as "won't run under Python 3, sorry" ?
/cc @maxluebbe
Yeah the separate library (or component) (e.g. pip install ipython[all]
) is likely a good way to go.
@jgeewax Is there some reason you feel like happybase
is the "one true" library for HBase? Some recommendation from somewhere or any other quantifiable signal?
@dhermes We've had requests from trusted testers to make this library work, and were directed by product management to make this work. Is there an alternative Python HBase client library you have in mind?
None at all. I am a total n00b here, just wanted to know where it came from.
We'd love to hear from the TTs here too.
FWIW happybase
seems to be a fine library and has many very similar concepts to this one. I just have no basis to understand its prevalence.
Also @tseaver mentioned previously that we may end up / be better off just making
import happybase_gcloud as happybase
work without any other code changes necessary.
@dhermes : Any status updates here? Can we maybe come up with the list of crap we'd need to do?
Sorry was boarding a plane when I saw this comment :) Working on it today
I've been hacking on this, but grpc
install is still posing a big issue.
Which protos need to be compiled for the service to work? It doesn't seem like there are any docs (on https://cloud.google.com/bigtable/docs/) for the actual web API, but maybe I'm supposed to use the .proto
files as documentation. Every doc just assumes you're using an HBase client, though the python samples (rest and thrift may also provide some "docs" (a bit cowboy-coding-esque).
The HappyBase API would be the ideal API for Python, based on the HBase API we use for the Java client. HappyBase talks to a Thrift server, but we'd rather it talked directly to a Cloud Bigtable server.
For what you are doing, you might benefit by looking at the GoLang Client as it's use of the gRPC protos for Bigtable might help.
@lesv Sadly there is a custom gRPC Go client, so its much easier for them to integrate. The install story for the gRPC C/C++ libraries is the biggest hang-up. As for the protos, I already linked to them and as for HappyBase, that is the current plan.
The protos I from the HBase Client repo are the ones you need: gRPC protos for Bigtable
@dhermes : Any progress to report ? People getting itchy about this one........
I still haven't gotten what I needed.
Main blockers:
grpc
install doesn't really work (OpenSSL dependency/conflict and upstream proto3)v1approved
to v1
.)If I want to put grpc
aside for the moment and just focus on the Cloud BigTable server (@lesv says "HappyBase talks to a Thrift server, but we'd rather it talked directly to a Cloud Bigtable server.")
.proto
files as documentation? Cowboy coding may work for Xooglers and really motivated OSS contributors, but do we expect users to do the same?rest_client.py
certainly seems to show some REST endpoints, but the README for it says "HBase REST Gateway setup and configuration" so it's probably just thrift over REST.)OK @dhermes, can you list out the specific things you need here so we can start getting them for you ?
If we want to go directly to the Cloud BigTable server:
.proto
files define the insert/retrieve endpoints? (I.e. typically REST-y endpoints.)protoc
command should be run? Do i need the proto3 compiler (needed for grpc
)? Is this documented anywhere?.proto
files) with the library or does every user need to compile them on their own? (I hope the former.)Re 3: I'm being told by @lesv that Proto3 won't compile Python that can run on different systems... I'm astounded at this, but apparently it's true...
Can anyone double confirm as this seems like it'd be a showstopper... ?
I'd guess the Python files are the same, but the actual native dependencies are not. It could also be possible that the compiled Python files are just .so
C extensions, which would be platform dependent.
In looking over the grpc.io docs, it looks like the .so files are only required for building the thing and for the http2 stuff. The protos are in python. (sorry for the confusion)
OK - can someone build us some Python from those protos? :)
Nevermind -- just did it, sent the protos to @dhermes via e-mail for him to include in his PR.
Made https://github.com/dhermes/gcloud-python-bigtable and will ping this thread when there is something working. Eventually we can prune the commit history and move the repo over to GoogleCloudPlatform
(or integrate it into this repo).
After a false start using HTTP/1.1 (not supported with the BigTable API) I've successfully got the _pb2
modules built with the gRPC extensions.
However the basic docs don't explain how to call an external service: http://www.grpc.io/docs/installation/python.html http://www.grpc.io/docs/tutorials/basic/python.html (e.g. how to set the path for a request, how to authenticate, etc.)
Would @nathanielmanistaatgoogle or @tbetbetbe or someone else on the gRPC team be willing to chat?
Right now I'm trying to adapt the sample code via: https://gist.github.com/dhermes/2edb97d9581b5ec471eb and am not having success (can't tell if a request is succeeding or failing).
We want to make users of Happybase "just work", so...some ideas:
Option 1: A wrapped import module
Option 2: Some sort of monkey patching?