Open epogrebnyak opened 7 years ago
Maybe release python and/or R api-client for users to access the API easily from jupyter notebooks?
@neotheicebird Yeah, we discussed it at the meeting - using AWS API.
@Rotzke awesome! Just to keep us on the same page, I mean a python/R client side library apart from the web API development. Thanks
@neotheicebird Up to you, good sir :) Created a new issue on teams.
@neotheicebird - al least some standard code to access the data will be very useful. In pandas we have somehting like:
dfm = pd.read_csv(url_m, converters = {'time_index':pd.to_datetime},
index_col = 'time_index')
This works to read monthly data from stable URL, it is slow to query internet every time we run the program so may have some class to load/update data, similar to one below (from here):
class LocalDataset():
def __init__(self, _id):
self._id = _id
try:
self.ts = get_local_data_as_series(_id)
except:
print("Cannot load from file: " + self.filename)
self.update()
def update(self):
self.ts = get_data_as_series(self._id)
save_local_data(self._id, self.ts)
return self
Maybe this can be a client/small librabry/pypi package, but as far we can do a just some preferred code to download and manipulate the data. Updated pipeline in indicate this.
Awesome, didn't know about pd.read_csv
having an URL arg
@epogrebnyak the code example and a simple pypi package to access API sounds good
@neotheicebird @epogrebnyak Guys, we have Slack for chatting! :)
Based on discussion with @Rotzke, updated pipeline:
Some more detail on pipeline, based on mini-kep:
Raw data:
Parsing:
Transformation:
Frontend:
End-user:
This is to discuss a role of interim database.
My thoughts are still about a minimum working example (MWE) for several parsers that can produce compatible output, and a pipeline to allow them working together. Here is an example of this kind.
End user wants to calculate Russian monthly non-oil export and see this figure in roubles. This is a bit simplistic task, but it is two galaxies away from everyday some Excel calculations, just about one. We need something that drags data from different sources.
The formula will be:
EXPORT_EX_OIL = FX_USD * (EXPORT_GOODS_TOTAL - NAT_EXPORT_OIL_t * PRICE_BRENT * CONVERSION_FACTOR)
EXPORT_EX_OIL - non-oil export in rub
FX_USD - exhange rate, rub/usd
EXPORT_GOODS_TOTAL - total goods export
NAT_EXPORT_OIL_t - oil export volume, mln t
PRICE_BRENT - oil price, usd / barrel
CONVERSION_FACTOR is about 6.3 b / t```
The sources are:
Multiple data sources. Imagine you have working parsers for Rosstat, EIA and Bank of Russia publications. Each parser will produce output in its data/processed
folder, some output CSV files. To complete the task the end user queries the URLs of data/processed
folders with pd.read_csv
and merges the dataframes. The rest is calculation on dataframes.
For this to work well:
This is a parser-to-notebook solution, no database, no API.
Single data source. Imagine someone took the burden to collect the output CSV into one dataframe for you and told you this is a your reference dataset, go ahead with it. In other words, someone took care of problems #1, #5, #6, and hopefully, #7. You deal with just one URL, but when needed you can check it at source. This single datasource may be a meta-parser and probably can also be a github repo.
There is still no single database and no API, but this:
Still not convinced where exactly a interim database fits (storing parsing inputs?), but so far @Rotzke says we need one, so I take it for granted.
A kind of little roadmap to keep going I think is the following:
data/processed
data/processed
outputs to a single CSV From this sceleton we can quickly do a common database, database API and many other magnificent stuff (even an interim database) as well as add more parsers.
Hope someone still wants to do this (this way). ;)
After 20.06.17 videochat, brief notes: Our pipeline to work with data is the following:
todo to follow!
Our project is about aggregating data from individual parsers under common namespace and releasing the data through final API (correct if something missing):
data/processed
folder.data/processed
CSVs from several parsers in common database.Comments welcome.