mcekovic / tennis-crystal-ball

Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction
Apache License 2.0
252 stars 36 forks source link

Installation Instructions #232

Open nneha opened 6 years ago

nneha commented 6 years ago

How can I fully populate DB with all available data?

mcekovic commented 6 years ago

[Updated Instructions 03.02.2021.]

  1. Ensure Java JDK 11 is installed on the machine (JDK 11.0.10+ preferable).

  2. Ensure PostgreSQL 13 (13.1+ preferable) is installed.

  3. Login to postgres PostgreSQL database as admin user (usually postgres), create database tcb: https://github.com/mcekovic/tennis-crystal-ball/blob/master/crystal-ball/src/main/db/create-db.sql

  4. Login to tcb PostgreSQL database as admin user (usually postgres) and create schema and user tcb with password tcb: https://github.com/mcekovic/tennis-crystal-ball/blob/master/crystal-ball/src/main/db/create-login.sql

  5. Clone UTS Git repo https://github.com/mcekovic/tennis-crystal-ball.git: git clone https://github.com/mcekovic/tennis-crystal-ball.git

  6. Clone Jeff Sackmann ATP data git repo https://github.com/JeffSackmann/tennis_atp.git to some local directory also: git clone https://github.com/JeffSackmann/tennis_atp.git UTS is guaranteed to be compatible with the repo as of tennis_atp commit 070571f00420d5480086f75c7019b8518c4b338b on 2 Feb 2021

  7. Build the UTS by going to UTS local source directory and running: gradlew.bat assemble [windows] sh ./gradlew assemble [linux]

  8. Unzip content of data-load/build/distributions/data-load-1.0-SNAPSHOT.zip that is created by step 7.

  9. Install PostgreSQL extensions required for the schema: data-load-1.0-SNAPSHOT/bin/data-load.bat -ie -u postgres -p postgres On Linux, use data-load.sh. If you created PostgreSQL admin user with different username/pwd, this is configured with following command line arguments changed in above command line: -u postgres -p postgres

  10. Install Selenium WebDriver for Firefox (https://github.com/mozilla/geckodriver/releases) or Chrome (https://chromedriver.chromium.org/downloads) and ensure the browser for the chosen WebDriver is installed. Edit data-load-1.0-SNAPSHOT/bin/data-load.bat (or data-load.sh on Linux) and update WebDriver location (-Dwebdriver.gecko.driver=...)

  11. Run database schema creation: data-load-1.0-SNAPSHOT/bin/data-load.bat -cd On Linux, use data-load.sh. If you created PostgreSQL login with different username/pwd, this is configured with following command line arguments added to above command line: -u tcb -p tcb

  12. Run data load (you need Internet connection during the data load to fetch additional data from ATP web site): data-load-1.0-SNAPSHOT/bin/data-load.bat -lt -c 10 -bd %full path of tennis_atp data directory by Jeff Sackmann% On Linux, use data-load.sh. Parameter -c 10 means use 10 database connections for parallel load where applicable. This is a good setting for a medium hadware with ~8 logical processors (i.e. 4 core processor with hyper-threading). Again, if you created PostgreSQL login with different username/pwd, this is configured with following command line arguments added to above command line: -u tcb -p tcb

  13. Wait ~2 hours

tomeydemarcos commented 4 years ago

hi ! thanks for this. Im having issues in the data-load.

Oct 04, 2020 1:58:32 PM groovy.sql.Sql withBatch WARNING: Error during batch execution of 'CALL load_ranking(?, ?, ?, ?)' with message: Batch entry 974 CALL load_ranking('2019-11-18 -03', 208109, 1260, 4) was aborted: ERROR: deadlock detected Detail: Process 3250 waits for ShareLock on transaction 60948; blocked by process 3251. Process 3251 waits for ShareLock on transaction 60947; blocked by process 3250. Hint: See server log for query details. Where: while inserting index tuple (15361,70) in relation "player_ranking" SQL statement "INSERT INTO player_ranking (rank_date, player_id, rank, rank_points) VALUES (p_rank_date, l_player_id, p_rank, p_rank_points) ON CONFLICT (rank_date, player_id) DO UPDATE SET rank = p_rank, rank_points = p_rank_points" PL/pgSQL function load_ranking(date,integer,integer,integer) line 9 at SQL statement Call getNextException to see other errors in the batch.

and then loads the file

Rows: 916296 in 2.404 min (6363 row/s) Loading file '/Users/tomey/tennis_atp/atp_rankings_current.csv' .................. Rows: 17401 in 2.783 s (8700 row/s) Total rows: 2857080 in 7.631 min (6251 row/s)

and then 503 error appears

Fixing rank points... Loading pre-ATP rankings... Loading missing rankings... Fetching rankings URL 'https://www.atptour.com/en/rankings/singles?rankDate=1977-08-23&rankRange=1-500' Exception occurred: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://www.atptour.com/en/rankings/singles?rankDate=1977-08-23&rankRange=1-500 [retry 1 follows]

it´s broken?

thanks!!!

mcekovic commented 4 years ago

Re 1-st issue, which PostgreSQL version are you using? Re 2-nd issue, unfortunately ATP has introduced anti-scraping protection (Cloudflare), so it does no work anymore.

tomeydemarcos commented 4 years ago

Im using PostgreSQL 12.1 . I should use a low version? Anyway, I have successfully set up a EC2 instance with the docker image. I have all the data in that container? or i need also Jef Sackman data? I need data from 2019 only.

thanks for the good work!

On Mon, Oct 5, 2020 at 8:15 PM Mileta Cekovic notifications@github.com wrote:

Re 1-st issue, which PostgreSQL version are you using? Re 2-nd issue, unfortunately ATP has introduced anti-scraping protection (Cloudflare), so it does no work anymore.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mcekovic/tennis-crystal-ball/issues/232#issuecomment-703938218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2DB3M5RANLFFXWBJ2OER3SJJHP5ANCNFSM4ESQICUQ .

mcekovic commented 3 years ago

In the docker image you have the whole 2019 season (up to and including), but no 2020, so for 2019 you do not need anything more. Re deadlock, it is actually a warning, not error. When multi-threaded and batched rankings load is working, it is possible to get into deadlocks when two parallel batches edit two players in reverse order, but in that case batches are further divided and retried, so no data is lost.