Comments are open to Intro - part 1. Case for machine-readable data in economic analysis.

epogrebnyak commented 7 years ago

In economic analysis theres is a range of compiting tasks from simple spreadsheet models (eg discounted cash flow to valuate a company) to econometric models (eg exchange rate forecasting), wholly or partially done in Excel.

Excel is deeply rooted into economic analysis, but this is changing. In Excel it is hard to make your graphs or models truely reproducible and transparent. In my mind economic analysis, say in a bank, will soon be 'robotised', as many other jobs. A new generation analyst will command a collection of apps, not Excel spreadsheets.

For non-Excel workflow you need the following (see also here):

datasource: a machine-readable datasource like FRED or other open data API
data analysis tools: something to dig data with and make a bunch of plots, eg Pandas/R + possibly Jupiter notebook (former IPyhton)
modelling tools: a model to fit data and do forecasting, it is R/Gretl likely for econometrics, something else for machine learning (not too familiar myself). Data analysis and modelling tool can be the same package, eg R.
presentation: some way to produce reports and presentation which avoids or complements Word or Powerpoint: a Jupiter notebook, Gitpitch, or other

Building a new workflow (important disclaimer: still not achieved for me) can go top-down (new tools, old data) or bottom up (new data source first). Somehow due to quailty of machine readable data sources I went the 'start from the bottom' way, to be covered in second part of introduction.

Any questions here:

Does "datasource - analysis - model - presentation" structure hold for projects you worked on?
What other tools/sources you find useful of each stage
Other comments/ideas?

May write in Russian, will summarise later.

Rotzke commented 7 years ago

Projects in which I was involved were much sophisticated, with two main flows:

ETL (Extract/Load/Transform) part for data engineers, or sometimes data architects or database administrators (DBA).
DAD (Discover/Access/Distill) for data scientists.

I was thinking about the structure of project according to README and structure of Cookiecutter system, this tasks allocation model fits fine.

You didn't clearly tell what is the final desired form of the project @epogrebnyak but I assume that web application would be a great choice. We could make a backend part - raw data gathering and storing with our own scraping framework/array plus MongoDB for keeping an archive so that scientists like you could easily get access to it, plus frontend on, let's say, Django, which @MrBorusLee is proficient with, and Redux NoSQL for quick access to current and actual data. Also, REST API would be great for both team and clients convenience.

As a concrete foundation for this, I would propose Amazon AWS, as mentioned before - it has a whole year of a free tier to let us perform tests and adjustments before full product deployment.

Also, would be cool to place this issue on next online meeting agenda, for it needs concurrent discussion, I think.

Thanks for attention!

epogrebnyak commented 7 years ago

I think next post in introduction should really be about end-user specification and architecture proposal for open economic datasets, before everyone gets really bored about talk and no programming.

Just to give an insight the idea is simple:

there are several raw data sources ranging from Rosstat publications in Word (yes, Word) files, to html pages, to xls files at Central bank and Treasury, to dbf forms for bank report and csv for corporate reports. There are also some 'open source' portals which do not seem to have an open API, just options to select something in browser, will provide links later.
each data source is handled by a separate parser/crawler (or a parser-to-be-crawler) which extracts information from raw immutable source to 'data/processed' folder with stable CSV output. This csv holds a time series
these time series are combined by a common namespace for variable names, eg GDP_rog for GDP rate of growth
on top of individual parseres there should be a master program to invoke parsers when new data arises to produce new datapoints (from daily to monthly frequency) and to collect CSVs from different sources to a larger database.
a larger database shoudl have a simple frontend that allows to navigate through variables list/menu and an API for user to download data. The simplest form of API would be a stable link to CSV file, right?
in short I see the dataset as a collection of individaul parsers + collection daemon + a master database + API to database + web frontend
there are some weak links/risks like is parser stable on new data? is new data for previous dates a revision or an error reading a file? and possibly many others.

Should I do some requirement writeup next?

Rotzke commented 7 years ago

@epogrebnyak don't forget about your devil - first talk and then programming :) You are right, Eugene, I think you could put this structure into next issue so people could propose their solutions for each section in comments and then use this as an agenda start on the meeting. Right?

epogrebnyak commented 7 years ago

I think it is good idea to schedule a live session to discuss requirements?

It involves two questions actually:

end user requirement (no programming terms, just what the end user - an analyst or admin wants)
architecture (solutions to requirement, basically, the building blocks)

(1) is often neglected, for good (the user is happy enough with whatever he gets) or for bad ("why exactly are we building this this way?"), so my role is to emphasise this first part and probably @Rotzke can organise discussion on the second part (design).

As for requirements, it would be great to analyse then against at least parts of this great checklist from Code Complete: http://www.matthewjmiller.net/files/cc2e_checklists.pdf, pages 5-6. Perhaps design checklist is useful as well.

Other links on requirements, quite long reads, but still:

https://qracorp.com/write-clear-requirements-document(well organised in chapters, can glance through headers)
chapter 1 in SWEBOOK (very technical)

On top of that I'd say prototyping is very important, lets build a small system first, klearn from it and then add to it.

Rotzke commented 7 years ago

It is a great idea @epogrebnyak :) Also:

Agree with meeting agenda as well, Evgeny.
Agree with "small system first", huh.

ru-stat / data-team-ru-stat

Comments are open to Intro - part 1. Case for machine-readable data in economic analysis. #5