martin-galajda / ML-soccer-data-project

0 stars 0 forks source link

Convert DB data into better format (CSV?) #1

Closed martin-galajda closed 6 years ago

martin-galajda commented 6 years ago

Right now we have data in form of sqlite database.

It would be nice to convert them into CSV or sth similar so we can load it easily inside R environment (and not deal with relations - ids, etc...)

Probably some preprocessing using some scripting language (javascript, python) would be best option for this.

aiorla commented 6 years ago

Some thoughts about the issue:

martin-galajda commented 6 years ago
  • How much preprocessing do we want in the scripting language?
    • (e.g. do we create the features "home-team last result" or "away-team last season win %" entirely in it? IMO we do all reasonable work in R (probably the initial SQL => CSV is not recommended).

I would say that we just parse all data from database ("flatten" data which is connected in db by ids) and export it into csv (maybe multiple csv will be needed). Then we can do feature extraction and additional cleaning in R (as they are probably mostly interested in our work with R if I understand it correctly).

What do you think? @aiorla @floooko

aiorla commented 6 years ago

Yeah, sounds :ok_hand:. Probably there is even a way to do this small step in R but it will probably be easier/shorter in Python. And yeah I also got the feeling that we were going to be evaluated by our R code.

floooko commented 6 years ago

I never worked with Julia before, I worked a lot with JS and a bit with Python. In the end they are all the same and Google helps to find the correct syntax ;-) If you don't like JS, @aiorla, I'm absolutly fine with Python. And as you said before, the most work we will do in R anyway. If its possible to export directly using SQL that might be the best option, otherwise let's do it in Python.

aiorla commented 6 years ago

I didn't want to sound harsh about JS, I've never coded in it, it's just that I have the fear/problem of confusing syntaxes and I try to learn the least of them... 😅

aiorla commented 6 years ago

I've created a first version of a conversion script. It's in R because I've found out that there is a RSQLite library that basically gives you the tables in data.frame with 2 instructions.

There are still some TODOs to close this issue (noted in the code). The main problem that I've found is that the "detailed" info about a game (who scored, how many fouls...) is buried in strings. Someone will have to look what to do about it (or I'll try to do it next week).

PS: I'm not sure if I should have created a PR before pushing to master and discuss it there. What do you think?