ru-stat / data-team-ru-stat

Team to tackle dataflow in Russian economic statistics (macro, corporate, banking)
0 stars 0 forks source link

Live session to discuss project requirements and design Sun, June 18 21:30 (GMT+5:30) 19:00 (GMT+3) #6

Closed epogrebnyak closed 7 years ago

epogrebnyak commented 7 years ago

Agenda:

Venue: https://hangouts.google.com/hangouts/_/r7fusbkh3neivjcbabvqgxigsme

Discussion materials:

Suggested reading:

Time zones: 21:30 (GMT+5:30) - India, 19:00 (GMT+3) - Moscow, Dnipro

Please drop a note below if you are able to join.

Rotzke commented 7 years ago

I am in :)

Akrishna91 commented 7 years ago

I am able to join yeah!

Rotzke commented 7 years ago

Also, guys, please join our Slack team (not many people there for some reason): https://join.slack.com/openstat-team/shared_invite/MTk4MTUzODM3NTM4LTE0OTc1MTczODctMThhNTE3Yzc1NA

neotheicebird commented 7 years ago

Am in too :)

MrBorusLee commented 7 years ago

I'm in

rohlin17 commented 7 years ago

I'm in

epogrebnyak commented 7 years ago

Dear all, here is some material for requiremets, to be continued. Please put in some sort of like / dislike once you read it, but let's discuss it at session. More parts to follow.

Mission

Enable access to continiously updated macroeconomic, corporate and bank data for Russia.

1. Macroeconomic time series

User cases

User 1 ('Sasha') is a data analyst. He is proficient is R/pandas and likes CSV files better than Excel. He knows to demostrate his work in Jupiter notebooks, willing to upload to Github and wants his graphs and models to be continiously updated without him going to download data manualy. He worked with Quandl and FRED and at work he has paid access to CEIC and Blooomberg.

User 2 ('Vera') is also a researcher. She is confident with Excel, so that she can do all of her work without depending on programming. She maintains er universe of data, graphs and models as a collection of local files. Her boss is Excel user too, so there is little incentive for Vera in going to R/pandas. She would consider downloading the initial data in Excel file, eg. exhcange rates or interest rates, but only once she trusts the data source. She tried some Excel add-ons, but did not like them too much.

User 3' ('Simon') is a journalist. He covers Russia economics and once in a while makes comments about data releases. He calls analysts in banks and research centers to find out thier views about newest inflation or household income figure and can deliever plain-word on economic story. He is BA in Medieval Literature, so he is not into Excel at all. Simon would rather click on some graphs allowing him to interrogate analysts better. He is also keen to post his story on FB or Tweeter before other journalists do.

User 4 is a Robot. It learned we have some great open data and wants to borrow from us.

User 5 is a boss's secretary (give name). She wants to pass over a printout to a boss, with "all latest macro", so the boss feels confident at a meeting. Neither is an expert in macroeconomics, but the print out allows asking questions and sustain some discussion. See nice example of "all macro" at NY Fed.

Data access methods by users:

Principal:

Additional:

Data structure

A macroeconomic time series dataset is a pandas/R dataframe with variable names as column names and timestamps as row heads. At monthly frequency it looks something like this:

<insert example here>

Frequencies: The dataframe may be is at annual, quarterly, monthly, weekly or daily frequency. Some variables are annual and by quarter (GDP), some are found at all frequencies (exchage rate). Lets denote frequencies by "aqmwd". For each frequency there is a own dataframe. Cannot mix frequencies in single dataframe.

Files: A dataframe is dumped to CSV file, found at stable URL or retrieved by API.

Namespace: variable label consists of two parts: varname in capital letters (GDP) and unit of measurement in lowercase (rog), joined by single underscores (GDP_rog). Parts my have own underscores. Along the code I use label for full variable name, varname is upppercase part and unit for lowercase.

Vintage: a dataset is realeased at some date if called a vintage. Some of macroeconomic data gets revised (like GDP), so accuracy of first estimate (GDP for 2017Q1 released in May or June 2017) compared to final figure (GDP for 2017Q1 published to the end of 2017) is subject to professional discussion. For many end-users they just want the latest values.

User scenarios

List of data sources

Agencies:

TODO: Add links to sources and repos, ask Marcel to review/comment, make this a table?

Wins

Risks and checks

Extentions

Delete upon review:

Usercase for the project, based on:

2. Corporate reports

https://github.com/epogrebnyak/data-rosstat-boo-2013 - machine-readable dataset of 2012-2015 Russian enterprises financial reports

3. Bank reports

https://github.com/epogrebnyak/cbr-bankform-reader - Reads bank form data from DBF and text files. Emits clean data by row, storable in a database. Supports form 101 and 102.

Rotzke commented 7 years ago

Well, Vera can use the same CSV as Sasha in Excel :) Would suggest:

epogrebnyak commented 7 years ago

You have to know Vera ) she wants three frequencies of variables on different sheets followed by sheet with varnames. Cannot deny what a woman wants.

But to sort things out the priority is Sasha and Robot, and Vera and Simon as add-on features.

18 июн. 2017 г. 11:44 пользователь "Nikita Rotsky" notifications@github.com написал:

Well, Vera can use the same .CSV as Sasha in Excel :) Would suggest:

  • API for Robot and Sasha;
  • CSV for Sasha and __Vera;
  • Google Charts on website with sharing option for Simon;

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/epogrebnyak/data-team-ru-stat/issues/6#issuecomment-309264698, or mute the thread https://github.com/notifications/unsubscribe-auth/AI1grnGrdh-Mxq4D5pkh3UIrnd1L6SaOks5sFONkgaJpZM4N8Tt2 .

Rotzke commented 7 years ago

@epogrebnyak Oh, OK, if it is mandatory we could use something like "xlsxwriter" library to convert CSV into XLS to please this woman ;)

Agree with priorities.

rohlin17 commented 7 years ago

I spent six years in the Financial and Economic Department of one of the Top 50 Russian banks. Excel is absolutely reporting standard. Yes, we developed a big data warehouse and quite complex Business Intelligence system, but Vera will use use Excel forever - she downloads Excel reports from BI system and slice and dice them like she wants on her desktop computer. And here Excel is the best.

epogrebnyak commented 7 years ago

Todo after the live session:

Is this correct?

Rotzke commented 7 years ago

Actually, write 1 pipeline with requirements :)

Everything else is correct.