googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.83k stars 1.52k forks source link

Feels like we've added a bit more complexity to the datastore API, comments? #647

Closed jgeewax closed 9 years ago

jgeewax commented 9 years ago

Hi guys,

I was reviewing the "state of the world" of the datastore API, and found a few things that seem (to me at least) a bit more complicated than they should be.

A bit of this comes from the authentication stuff (which we have another open bug for) but I wanted to bring up a few things for discussion... If this issue gets out of control, we can move the discussion elsewhere, but I wanted to kick things off:

Create a new Person entity in the Datastore

What I have to write

from gcloud import datastore
datastore.set_defaults()
entity = datastore.Entity(key=datastore.Key('Person'))
entity.update({'name': 'Jimmy'})
datastore.put([entity])

What I'd like to write

from gcloud import datastore
entity = datastore.Entity('Person')
entity.update({'name': 'Jimmy'})
entity.put()

Key differences

  1. no set_defaults() (there's another issue open for this I believe)
  2. Create an entity of kind Person without lots more typing (or maybe we get the entity with datastore.Kind('Person').get_entity() ?)
  3. .put not requiring a list for a single entity

    Get an existing Person (123) from the Datastore

    What I have to write

from gcloud import datastore
datastore.set_defaults()
entity = datastore.get(datastore.Key('Person', 123))

What I'd like to write

from gcloud import datastore
datastore.set_defaults()
entity = datastore.Kind('Person').get_by_id(123)

Key differences

  1. No set_defaults
  2. Equivalent of "drilling down" into a "table" (kind), and then a "row" (id)

(Not hugely concerned about this use case, but wanted to toss it out)

Create a new Person using a specific set of credentials (/home/creds.json) to talk to a specific dataset ('my-dataset') in the Datastore

(Being looked into in PR https://github.com/GoogleCloudPlatform/gcloud-python/pull/641)

What I have to write

_(This is a best guess... I don't know if this code is even right... And even so, this code won't do the same thing if you ran it on App Engine because of the order of evaluation here: http://googlecloudplatform.github.io/gcloud-python/latest/gcloud-api.html#gcloud.credentials.get_credentials )_

import sys
sys.env['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/creds.json'

from gcloud import datastore
datastore.set_defaults()
entity = datastore.Entity(key=datastore.Key('Person', dataset_id='my-dataset'))
entity.update{'name': 'Jimmy'})
datastore.put(entity)

What I'd like to write

from gcloud import datastore
datastore.set_credentials('/home/creds.json')
dataset = datastore.Dataset('my-dataset')
entity = dataset.Entity('Person')
entity.update({'name': 'Jimmy'})
entity.put()

Key differences

  1. Simple in-code way to set which credentials file to use
  2. High-level concept of a dataset
  3. Datasets can provide Entities just like datastore would

/cc @tseaver @dhermes @pcostell

dhermes commented 9 years ago

@jgeewax We spent a lot of work to get rid of the high-level concept of dataset. The philosophy was that we should make the objects and their constructors easier to use and that "learning" another class was overkill when the dataset was a) just a string and b) something that was a single-valued constant for most users.

jgeewax commented 9 years ago

That's going to be a really big problem when Datastore exposes multiple datasets per project...

elibixby commented 9 years ago
  1. I think that the dataset was a good higher-level concept, but it kinda got bulldozed when the major restructuring happened. You basically already have a dataset module, which is api. I think it's better to construct a dataset from datastore and leave datastore for meta-methods like set_defaults() now.
  2. I really really don't think you should be adding functionality to Entity. It inevitably ends up moving towards ORM which inevitably makes the library more and more heavyweight, and duplicates use-cases from NDB and DB. If anything I think the easiest way forward is to simplify entity. The most lightweight syntax I can think of is:

SETUP

datastore.set_defaults() //does auth and connection stuff
d = datastore.Dataset('dataset-id', **kwargs) //more advanced options like read_options should be put here since it is a one time operation

Put a single entity

key = d.put(Key(*path), Entity(**props))

Put a bunch of entities

d.put({key1: entity1, key2: entity2 ...})

Get an entity

d.get(Key(*path))

A couple comments about "what I'd like to write":

ent = dataset.Entity(key)
ent.update(props)
ent.put()

preferable to

ent = Entity(props)
dataset.put(key, ent)

I am completely boggled by this.

Finally, as an aside, I submitted a PR to remove the list wrapping requirement for single arguments #623 , but if you want to support python 27 and 26 then you need to use **kwargs and a parser rather than explicit named arguments. Dunno if it's worth it. Python 3 supports def func(arg, *args, kw=value, **kwargs) syntax.

elibixby commented 9 years ago

@jgeewax just wanted to add that I agree with the premise of this issue. I'm writing the samples for the landing page as we speak, and it's awkward to say the least.

jgeewax commented 9 years ago

Re "why .put()"

It's similar to the ext.db API. This is a matter of preference. See https://cloud.google.com/appengine/docs/python/datastore/modelclass#Model_put

profile = datastore.Entity('Profile').get_by_id(123)
profile['likes_fruit'] = False
profile.put()

Re "why .get_by_id()"

It's a helper-method for ultimately constructing a key which reads more clearly (to me). It is also similar to ext.db which provides a get_by_id method. Again, this is a matter of preference.

profile = datastore.Entity('Profile').get_by_id(1234)
# is the same as
profile = datastore.get(Key('Profile', 1234))

Re the comment about querying

We shouldn't be encouraging people to think of this as relational database. It isn't.

Nowhere here are we talking about relational data. get_by_id is a shortcut for get(Key(<kind already known>, id)). See https://cloud.google.com/appengine/docs/python/datastore/modelclass#Model_get_by_id

We also shouldn't be encouraging them to make 'kind' queries when they know the id, not only is it confusing, it's inefficient.

I don't see a query here, nor do I see anything related to efficiency. The code I'm talking about would likely look like...

class Entity(object):
  ...
  def get_by_id(self, id):
    return datastore.get(Key(self.kind, id))
elibixby commented 9 years ago

I thought the goal was to move away from the DB implementations, not take queues from their implementation. If the goal is to cover different use cases the syntax is going to be different.

RE: kind(foo).get_by_id(bar) I mistakenly assumed from your comment

Equivalent of "drilling down" into a "table" (kind), and then a "row" (id)

that this was how you meant it, and then it would be implemented by pulling the entities of a certain kind and searching for the right id.

EDIT: If you really wanted to add extra sugar on top of key, you could do something like

dataset.key(parent=parent_key, kind='kind', id='id')
jgeewax commented 9 years ago

I don't recall ever saying or hearing "we're going to avoid everything done in ext.db"...?

I think the .put() style on an Entity is a nice thing. I don't have to know anything about the connection, dataset, or authentication -- and .put() will just "save the entity the way it was retrieved". This might seem obvious now, but when multiple datasets are a thing, and different "connections" are floating around, it might be important. It means an Entity on it's own can be enough to persist data.

I could be wrong here. Either way, I don't think we should take a hard-line stance of "no stuff that was in ext.db"...

elibixby commented 9 years ago

I think that is going to be more of a trouble than a convenience TBH. To quote Zen of Python explicit is better than implicit.

Before the major restructuring Entity had all the necessary information (dataset, key, etc) to have all this functionality on it's own, and usage was much more cumbersome IMO. It was the point where I would rather just have used NDB, since it felt like I was being forced to do all the ORM stuff anyway.

dhermes commented 9 years ago

@jgeewax Can we close this out?

The core issues were

Both of those have been implemented.

Anything remaining can be broken off into a smaller more digestible bug for discussion?

jgeewax commented 9 years ago

SGTM