goldmansachs / gs-quant

Python toolkit for quantitative finance
https://developer.gs.com/discover/products/gs-quant/
Apache License 2.0
7.54k stars 917 forks source link

Content module #36

Open g12mcgov opened 5 years ago

g12mcgov commented 5 years ago

Describe the problem

Currently there is no programatic way of accessing Marquee Content through gs_quant. This is a feature proposal to add a content module for interacting with the new Marquee Content API (/v1/content).

Describe the solution you'd like

State of the world:

At the time of writing, there are two primary means of retrieving content via the Marquee Content API:

Description Method Endpoint Developer Site URL
Get a content piece GET /v1/content/{id} Link
Get many content pieces GET /v1/content Link

Eventually, the entire suite of endpoints will be implemented which will allow querying, searching, updating, and creation of content.

Proposed Solution:

Get Many Contents:

gs_quant should expose out a Content module for supporting the above endpoints.

from gs_quant.content import Content

content = Content()
content.get_many(**kwargs)

# Kwargs correspond to supported query params on the API. I.e:
#
# content.get_many(authorId=<some_author_id>)
# content.get_many(tag=<some_tag>)
# content.get_many(assetId=<some_asset_id>)
# etc...

Get a Single Content:

gs_quant should expose out a Content module for supporting the above endpoints.

from gs_quant.content import Content

content = Content()
content.get('<some_content_id>')

All returned content will be of the form ContentResponse. A link to this object can be found here on the Marquee Developer Site.

By default, all content is Base64 encoded along with the associated MimeType. This allows for transporting the content via JSON, given that we support many different content types (HTML, text, image, PDF, etc...).

A client of gs_quant using that content module might then do:

import base64
from gs_quant.content import Content

content = Content()
response = content.get('<some_content_id>')

text = base64.b64decode(response.content.body)
# <html><h1>blah blah blah</h1></html>

Describe alternatives you've considered

Currently bouncing between the following two implementation styles:

1) Declare a Content() object (as examples show above) that creates an instance of the class, for doing things like:

content = Content()
content.get()
content.get_many()
...

2) Go the route of the Dataset model, where the code would look like this:

content = Content('<some_content_id>')
contents = Content('<some_content_id_1>', '<some_content_id_2>', ...)

Not really a fan of this approach for content as I think it's a little awkward / doesn't really provide a fluent API for querying/searching.

Are you willing to contribute Yes!

Additional context

N/A

@andyphillipsgs @francisg77 @bobbyalex83 @ScottWeinstein

andrewphillipsn commented 5 years ago

The Dataset model is intended to provide an abstraction to multiple data sources, i.e. to allow gs_quant to source data from other places than the Marquee API. For content, we should start with access to the underlying Marquee APIs e.g.

Note that in the Dataset example, the ID is to a datasource, not to an individual row. so in your example, it would probably map to a channel, not to a content item. let's get a couple more opinions as well

francisg77 commented 5 years ago

API class definitely the best for starting. Agree on the stream comments with datasets - although a Content() item just has to map to an actual piece of content with metadata; perhaps ContentChannel() becomes a first-class object. The question is also then where to functions like 'Get many content pieces' go, similar questions I would imagine for many other APIs:

Two key options as I see it:

  1. Content.get_many() - the typical API model. Convenient as stored on Content piece, blurs the boundary between data items and querying, similar to datasets.
  2. Separate ContentQueryEngine - similar to general server-side development, and assets with SecurityMaster. Clearer item/query distinction, but extra classes and level of indirection

Let's discuss

g12mcgov commented 5 years ago

@francisg77 @andyphillipsgs

Already added the abstraction layer you mentioned gs_quant.content, which takes in a provider (in this case GsContentAPI.

As for your other points, I prefer the first option, since it's also consistent with how datasets work currently in the API. Something like:

content = Content(channel='<some_channel>')
content = Content(assetId='<some_asset_id>')
...
# Default with no kwargs will just rely on the content-api doing a default lookup based on who you are.
content = Content()

Then, expose out a method(s) like:

content.get_many(offset=0, limit=10)

This way is nice too because helper class methods could be added to, like content.get_text() to extract the raw text / abstract away the need for base64 decoding etc.

Only issue I see with this is that there's no need for a get-single-content method now, right? (i.e. content.get('<some_id>'), but maybe that's not really an issue.

g12mcgov commented 5 years ago

@francisg77 @andyphillipsgs

Made an MR with the described changes above in the Gitlab repo. You can find that MR here: https://gitlab.gs.com/marquee/analytics/gs_quant/merge_requests/254

Had to do it internally since I needed to generate the new Content types.