googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.83k stars 1.53k forks source link

Build out some integration with Cloud Bigtable #872

Closed jgeewax closed 9 years ago

jgeewax commented 9 years ago

We want to make users of Happybase "just work", so...some ideas:

Option 1: A wrapped import module

from gcloud.bigtable import happybase

Option 2: Some sort of monkey patching?

from gcloud import bigtable
import happybase
happybase = bigtable.monkey_patch(happybase)
jgeewax commented 9 years ago

/cc @maxluebbe

jgeewax commented 9 years ago

/cc @dhermes @tseaver : Max might have some spare cycles to work on this, figured you guys should chat.

dhermes commented 9 years ago

Relevant docs: https://happybase.readthedocs.org/en/latest/ https://cloud.google.com/bigtable/docs/

tseaver commented 9 years ago

ISTM that monkey-patching the happybase third-party module would be a poor choice. I think we need to think of this as "write a happybase API emulation layer" instead (which may or may not itself even use the real happybase code).

dhermes commented 9 years ago

Can you elaborate? Why is it a bad idea? (I have yet to look at the happybase code/docs.)

tseaver commented 9 years ago

I think our goal is that they should be able to port scripts easily (ideally, just changing import happytable to from gcloud.bigtable import happytable and adjusting the connection parameters). I don't see interfering with the actual happytable module's internals as a useful way to approach that goal. For instance, users of gcloud.bigtable might need to interact with a real happybase/Thrift backend (e.g., to move or compare data between it and the bigtable dataset).

jgeewax commented 9 years ago

I agree that there may be times where someone would want to talk to Cloud Bigtable and HBase at the same time, so monkey-patching happybase sounds like we'd screw over those multi-tenant situations.

Maybe there's a way to from gcloud import bigtable and then happybase_compatible_client = bigtable.get_happybase_client() which uses the real happybase code, but swaps out a connection of sorts at the lower level...

tseaver commented 9 years ago

The gcloud.bigtable docs say:

Cloud Bigtable works with Hadoop 2.4 and higher and is interface compatible with Apache HBase 1.0.

If that means that it exports the same REST API as HBase / Thrift, then maybe we don't have to do much to enable happybase, beyond ensuring that we set the connection up with gcloud-relevant credentials.

jgeewax commented 9 years ago

This is the interesting piece. It is not wire-protocol compatible. That is -- it doesn't speak Thrift (AFAIK). Instead, they made the Java code compatible, and swapped out the thing that does serialization and sending of requests underneath.

I think @maxluebbe can comment more to confirm.

dhermes commented 9 years ago

Yeah that one is important. I was under the impression from @jgeewax previous comments that the web APIs were the same / close cousins.

brianolson commented 9 years ago

The Java "HBase" client to Bigtable appears to be an emulation layer. A Python client should be built on the same protobuf rpc interface as that: https://github.com/GoogleCloudPlatform/cloud-bigtable-client/tree/master/bigtable-protos/src/main/proto/google

I'd imagine that Python client having a native mode that just handles the proto-rpc; and then there could be a happybase emulation layer on top of that.

maxluebbe commented 9 years ago

Sorry have been OOO. Some clarifications. 1) Additional work item is modernizing Happybase to 1.0 HBase API. 2) Cloud Bigtable does not support Thrift/Rest interfaces. Part of the work that needs to be done here is decoupling the thrift parts of happybase from the interface provided to the user, so that we can drop in gRPC in place of Thrift.

jgeewax commented 9 years ago

gRPC or Protobuf over HTTP 1.0?

gRPC is not possible right now given the installation process (we need to keep our install to pip install gcloud and can't ask people to do a bunch of Makefile madness to use our library).

brianolson commented 9 years ago

Is it acceptable for gRPC to be an optional install? (and python bigtable might not work without it, but the rest of gcloud-python would) Maybe spin off the python bigtable client into something not part of 'gcloud-python'? (Seems fair, the Java bigtable client is its own project/repo.)

jgeewax commented 9 years ago

You're telling me there's no way to talk to Cloud Bigtable's API without using a gRPC client? We can't use HTTP 1.0 ? That seems.... really really weird. @maxluebbe @carterpage can you confirm?

carterpage commented 9 years ago

We're designed to be very low latency and high-throughput so all of our focus is on gRPC. There is a REST interface, which comes for free with our envelope, but we don't recommend using for much more than slow admin API calls or ad hoc test queries against the data API. Almost everything in the API is a binary string, including keys and values, so there would be a lot of base64 encoding going on -- a big CPU hit on both sides. We also have not tested streaming scans would work in REST.

dhermes commented 9 years ago

Can we help make the gRPC install process easy?

jgeewax commented 9 years ago

Some references.... (Short answer... not really. The best we have for Linux is sudo apt-get install linuxbrew && sudo brew install grpc)

I believe the issue here is that gRPC requires some crazy stuff in SSL libs and others, which won't make it into mainstream world for quite some time, and users can't have two versions of those libraries the way apt handles packaging (for Linux).

As for @carterpage 's comment...

We're designed to be very low latency and high-throughput so all of our focus is on gRPC. There is a REST interface, which comes for free with our envelope, but we don't recommend using for much more than slow admin API calls or ad hoc test queries against the data API.

HBase today doesn't speak HTTP 2.0 right? So wouldn't the REST-ful HTTP API calls be comparable to existing HBase clusters in latency and HTTP overhead?

carterpage commented 9 years ago

HBase has their own bespoke RPC system that's actually fairly fast. The issue for us is, using Google's plumbing, gRPC is the only option to provide the sort of latencies we need.

RESTful HTTP API calls would be better than running a Thrift server in front of HBase, since that adds an extra hop, but that's a fairly low bar. I think the performance comparison we want to aim for is Happybase against raw HBase. I believe that uses HBase's binary RPC format, so I don't see how we'd match it without gRPC.

jgeewax commented 9 years ago

Got it. We won't be able to "depend" on gRPC via PyPI, so that means in our docs we'll basically have to say "If you want to use Cloud Bigtable via gcloud-python, you'll need to follow these steps to install gRPC on your machine".

This will clearly have more drop-off, but I suspect we already have a somewhat small pool of people using Cloud Bigtable (given the lowest cost product we offer people is $1500 / month), so adding an extra "here install this" hurdle wouldn't be the end of the world.

dhermes commented 9 years ago

Shouldn't the dev exp be even better for the people paying more? (Sorry if that is an unpopular opinion.)

jgeewax commented 9 years ago

Haha - yes, it should. The developer experience in this case is blazing fast managed HBase -- which is pretty incredible (and what you're paying for).

As far as "installing the client", it seems that our problem with packaging gRPC is not really in our control. The library itself (I'm told) depends on some pretty serious changes to libraries which are in the pipeline but not aimed for main-stream for quite some time. The best we can do (again, I'm told) is using linux brew (brew install grpc).

carterpage commented 9 years ago

JJ hit the nail on the head. This is a pretty bleeding edge service, so there are still some rough edges that will get polished as more network stacks are comfortable with HTTP2, etc.. Our typical customer is technically sophisticated and willing to tolerate a little extra effort for the added power. Hopefully we'll have the best of both worlds before long.

dhermes commented 9 years ago

So you're extending existing Python libraries to support HTTP2 or making your own? Why not just ship a custom fork?

Is the issue compiling C extensions for Python or is the issue making new versions of system socket libraries (e.g. things in /usr/lib)? It should still be possible to just point to local paths for things that would typically be found in /usr/lib.

dhermes commented 9 years ago

Some context for Python 3 support in happybase (and deps):

https://github.com/wbolster/happybase/issues/40 https://github.com/wbolster/happybase/pull/78 https://github.com/wbolster/happybase/pull/95

maxluebbe commented 9 years ago

Let's not think about Python 3 at this time. Getting a solid Python 2.x experience is our top priority.

tseaver commented 9 years ago

@maxluebbe the rest of gcloud-python is already Python3-compatible. At a minimum, we need all "hard" dependencies to be compatible as well, so that our CI story stays sane. I guess we could keep our code source-compatible, make happylib a "soft" dependency, and define the wrappers / tests only if it is importable.

dhermes commented 9 years ago

@jgeewax Suggested we use happybase. I'm trying to get a sense of its popularity after @maxluebbe made me step back and ask "why happybase?".

It only has 17 tagged StackOverflow questions (as of June 10 at noon Pacific).

Checking installs is not so small:

$ pip install -U vanity
$ vanity happybase
...
happybase has been downloaded 487933 times!

This is in comparsion to 37,602,602 for boto and 10,308,673 for httplib2, so I'm not sure how to gauge it.

jgeewax commented 9 years ago

Well keep in mind that HBase is far less commonly used... The price point alone would be somewhat prohibitive for people here... (cheapest thing Cloud Bigtable sells is $1500 / month).

Think of it a bit like RedShift, costing $13k/TB/year -- if you could download a RedShift only library I wouldn't expect many installs showing up in vanity...

jgeewax commented 9 years ago

Also -- about the Python2/3 situation:

We're already effectively saying that you need to manually install gRPC. Would you guys prefer to have a separate library to install that is Python 2 only and requires gRPC? or to make a submodule that we flag as "won't run under Python 3, sorry" ?

/cc @maxluebbe

dhermes commented 9 years ago

Yeah the separate library (or component) (e.g. pip install ipython[all]) is likely a good way to go.

@jgeewax Is there some reason you feel like happybase is the "one true" library for HBase? Some recommendation from somewhere or any other quantifiable signal?

maxluebbe commented 9 years ago

@dhermes We've had requests from trusted testers to make this library work, and were directed by product management to make this work. Is there an alternative Python HBase client library you have in mind?

dhermes commented 9 years ago

None at all. I am a total n00b here, just wanted to know where it came from.

We'd love to hear from the TTs here too.

FWIW happybase seems to be a fine library and has many very similar concepts to this one. I just have no basis to understand its prevalence.


Also @tseaver mentioned previously that we may end up / be better off just making

import happybase_gcloud as happybase

work without any other code changes necessary.

jgeewax commented 9 years ago

@dhermes : Any status updates here? Can we maybe come up with the list of crap we'd need to do?

dhermes commented 9 years ago

Sorry was boarding a plane when I saw this comment :) Working on it today

dhermes commented 9 years ago

I've been hacking on this, but grpc install is still posing a big issue.

Which protos need to be compiled for the service to work? It doesn't seem like there are any docs (on https://cloud.google.com/bigtable/docs/) for the actual web API, but maybe I'm supposed to use the .proto files as documentation. Every doc just assumes you're using an HBase client, though the python samples (rest and thrift may also provide some "docs" (a bit cowboy-coding-esque).

lesv commented 9 years ago

The HappyBase API would be the ideal API for Python, based on the HBase API we use for the Java client. HappyBase talks to a Thrift server, but we'd rather it talked directly to a Cloud Bigtable server.

For what you are doing, you might benefit by looking at the GoLang Client as it's use of the gRPC protos for Bigtable might help.

dhermes commented 9 years ago

@lesv Sadly there is a custom gRPC Go client, so its much easier for them to integrate. The install story for the gRPC C/C++ libraries is the biggest hang-up. As for the protos, I already linked to them and as for HappyBase, that is the current plan.

lesv commented 9 years ago

The protos I from the HBase Client repo are the ones you need: gRPC protos for Bigtable

jgeewax commented 9 years ago

@dhermes : Any progress to report ? People getting itchy about this one........

dhermes commented 9 years ago

I still haven't gotten what I needed.

Main blockers:

  1. grpc install doesn't really work (OpenSSL dependency/conflict and upstream proto3)
  2. Which protos need to be compiled for the service to work? (Note, my last link was broken in the last 12 days because it was renamed from v1approved to v1.)

If I want to put grpc aside for the moment and just focus on the Cloud BigTable server (@lesv says "HappyBase talks to a Thrift server, but we'd rather it talked directly to a Cloud Bigtable server.")

  1. There are no docs (on https://cloud.google.com/bigtable/docs/) for the actual web API
  2. (Relating to above) Am I supposed to use the .proto files as documentation? Cowboy coding may work for Xooglers and really motivated OSS contributors, but do we expect users to do the same?
  3. Should I use the python samples (rest and thrift) as my "docs" for the server? (The rest_client.py certainly seems to show some REST endpoints, but the README for it says "HBase REST Gateway setup and configuration" so it's probably just thrift over REST.)
jgeewax commented 9 years ago

OK @dhermes, can you list out the specific things you need here so we can start getting them for you ?

dhermes commented 9 years ago

If we want to go directly to the Cloud BigTable server:

  1. Which .proto files define the insert/retrieve endpoints? (I.e. typically REST-y endpoints.)
  2. What protoc command should be run? Do i need the proto3 compiler (needed for grpc)? Is this documented anywhere?
  3. Can I ship compiled Python modules (compiled from .proto files) with the library or does every user need to compile them on their own? (I hope the former.)
  4. IIUC, gRPC talks to (mostly) the same endpoints, but over HTTP/2 instead of HTTP/1.1. Where are the docs for this gRPC API? Or am I just supposed to use the protos with nothing else documented?
jgeewax commented 9 years ago

Re 3: I'm being told by @lesv that Proto3 won't compile Python that can run on different systems... I'm astounded at this, but apparently it's true...

Can anyone double confirm as this seems like it'd be a showstopper... ?

dhermes commented 9 years ago

I'd guess the Python files are the same, but the actual native dependencies are not. It could also be possible that the compiled Python files are just .so C extensions, which would be platform dependent.

lesv commented 9 years ago

In looking over the grpc.io docs, it looks like the .so files are only required for building the thing and for the http2 stuff. The protos are in python. (sorry for the confusion)

jgeewax commented 9 years ago

OK - can someone build us some Python from those protos? :)

jgeewax commented 9 years ago

Nevermind -- just did it, sent the protos to @dhermes via e-mail for him to include in his PR.

dhermes commented 9 years ago

Made https://github.com/dhermes/gcloud-python-bigtable and will ping this thread when there is something working. Eventually we can prune the commit history and move the repo over to GoogleCloudPlatform (or integrate it into this repo).

dhermes commented 9 years ago

After a false start using HTTP/1.1 (not supported with the BigTable API) I've successfully got the _pb2 modules built with the gRPC extensions.

However the basic docs don't explain how to call an external service: http://www.grpc.io/docs/installation/python.html http://www.grpc.io/docs/tutorials/basic/python.html (e.g. how to set the path for a request, how to authenticate, etc.)

Would @nathanielmanistaatgoogle or @tbetbetbe or someone else on the gRPC team be willing to chat?

Right now I'm trying to adapt the sample code via: https://gist.github.com/dhermes/2edb97d9581b5ec471eb and am not having success (can't tell if a request is succeeding or failing).