matthewgilbert / blp

Pythonic interface for Bloomberg Open API
Apache License 2.0
112 stars 24 forks source link

BQL query output format #30

Closed rayanbenredjeb closed 7 months ago

rayanbenredjeb commented 8 months ago
from blp import blp
bquery = blp.BlpQuery().start()
df = bquery.bql(expression="get(is_eps(fa_period_offset=1)) for('AAPL US EQUITY') with(dates=range(-1y,0d))",)
df.secondary_name.unique()

Problem description

The current behavior yields a df with this format

image

There are 2 problems:

  1. For queries that receive an response where "secondary_name" contains duplicates, those duplicates are not properly handled. It seems that only the first one is returned. The code up here gives a minimal example.
  2. More generally, I find that the above pictured format does not allow the user to realign with absolute certitude secondary values to primary ones. Actually, it boils down to saying that input date ("as_of_date") cannot be considered as a regular "secondary value", there must be a ("security", "as_of_date") combo that serves as a unique identifier.

Expected Output

  1. Reproduce the Excel behavior where the n-th duplicated secondary named is suffixed with ":n" e.g. PERIOD_END_DATE would be, in excel BQL, derived as PERIOD_END_DATE:1 and PERIOD_END_DATE:2 for the above query.
  2. Again, this should reproduce the Excel output format which repeats the "ID" ("security") and "DATES" ("AS_OF_DATE"'s value) as they are the unique identifiers of the data point, and introduces a new column for each value and secondary value. image

Thank you for considering these changes, I believe it is the only two problems that make it less accurate than its excel counterpart.

matthewgilbert commented 8 months ago

Maybe @avantgardeam has thoughts on this since they are the author of the feature.

Original PR and discussion of the feature is here https://github.com/matthewgilbert/blp/pull/28

pedroteles17 commented 8 months ago

Thank you for raising this issue!

I acknowledge that the current parsing method may not be ideal, as was mentioned in the initial pull request (PR). However, we are dealing with an undocumented endpoint, which means its behavior cannot be fully guaranteed.

In response to your observations, it is essential to recognize that the data frame is structured in a long format. This context explains why the appearance of duplicates in "secondary_names" is not only predictable but also intentional.

Moreover, aligning secondary values with primary ones should not pose any significant challenges. Transforming the data frame into a wide format would further clarify this point.

Attempting to replicate the behavior of the Excel API could introduce complications, such as the potential loss of important information like "REVISION_DATE" in your case, and may also lead to issues with edge cases where the parser might not handle certain queries accurately.

With all this in mind, I am open to reviewing any PR aimed at enhancing the parser's functionality. I would be willing to implement changes myself, provided they do not result in data loss or overlook any edge cases. As of now, I am not aware of any alternative solutions that meet these criteria.

avantgardeam commented 8 months ago

Hello @matthewgilbert,

Upon further review and consideration of @rayanbenredjeb's concerns, I believe our current implementation might benefit from some revisions.

As an example, take a query similar to the one presented by @rayanbenredjeb: get(is_eps(fa_period_offset=1), gics_sector_name()) for('AAPL US EQUITY') with(dates=range(-1D,0d)). In response, Bloomberg provides two distinct dataframes, as depicted in the linked image.

image

For user presentation, we have a couple of options: delivering the data as either a list of dataframes or as a dictionary with field names as keys and dataframes as values. Another method, as suggested by @rayanbenredjeb, is to merge these dataframes based on their ID, akin to Excel's functionality.

Considering the separate nature of these datasets, my initial preference is towards using a list or dictionary format for clarity. However, I am open to suggestions and would greatly appreciate your feedback on this matter.

matthewgilbert commented 7 months ago

What would the data returned from similar query in Excel look like for reference?

My initial thought is that these seem like a chained set of queries which are unrelated so returning as a list of dataframes would make sense, instead of trying to shoe horn them into a larger dataframe.

Also what does the current implementation return for your above query @avantgardeam ?

avantgardeam commented 7 months ago

@matthewgilbert, Image 1 shows the query result from Excel, and Image 2 from blp.

The original idea was to melt various fields into a unified structure, followed by the concatenation of these dataframes. However, if you attempt to pivot the dataframe returned by blp you will get ValueError: Index contains duplicate entries, cannot reshape.

Finally, I agree with you that forcing totally different datasets into a single one is not ideal. Based on that, should we return a list of dataframes (as blpapi does), or use a dict with fields as keys and dataframes as values?

Image 1: image1

Image 2: image2

matthewgilbert commented 7 months ago

I think either is fine, although I would probably lean towards a list.