Test workflow - Githubissues

hannesdatta commented 2 years ago

Please test-run the workflow, report any occurring issues, and come up with a list of suggestions on how to improve the workflow.

Minimum requirements are:

download raw data
reproduce all results
consistently use make
use versioning

srosh2000 commented 2 years ago

Hi @hannesdatta ,

I am currently stuck unable to read the large all_books.csv file of almost 4gb. Error: Limited virtual memory

SessionInfo output for more insights into my machine specs:

I am currently working on the t2.micro ec-2 instance, I am wondering if the solution to this issue would be to upgrade to another instance type with higher RAM?

Found this thread which deals with this issue but the solution is still not too clear to me: https://github.com/Rdatatable/data.table/issues/3526

hannesdatta commented 2 years ago

Totally, this is a memory issue caused by ur small instance. Try one of the medium or large ones and pause it when u don't work on it. Save the bills from AWS so I can reimburse you eventually. Keep a max cloud budget for this month, say 100 eur, and be in touch should it not suffice. Ok?

Further, you can test the workflow on small datasets first and then only take it to the big machine. Make a sensible decision here.

Thanks for your work!

srosh2000 commented 2 years ago

I just finished testing the workflow entirely. After some trials with different types of instances, I used the t2.xlarge instance for smoother and faster workflow. I have versioned minor changes I made in the repo I forked: https://github.com/srosh2000/covid-19-book-consumption.git.

To quickly spell out the minor changes made/suggestions for improvement per src file:

data_download.R: add drive_deauth() to remove authorization requirements that may interrupt the make workflow temporarily.
data_download.R: mention file path already in fread command. Calling write.csv2 is kinda redundant.
add_re_scrape.R: for loop in line 59 can be improved with simple merge command i.e. merge(user_info, info_book_based, by.x = "User.Name", by.y = reader id)... but somehow it kept throwing an error, need to look into this a bit more:
Regression.R For analysis codes, I took a small random sample for testing purposes as the for loops on the large all_books df was taking ages. Maybe there is a more efficient alternative to for loop line 45? Similarly, line 55 in regression_age.R; line 46 in regression_fanatic_country.R; line 55 in regression_gender.R; line 54 in regression_robustness.R
Regression.R: Due to 0's in the y variable, running regression on the log transformed y throws errors. So better to perform log(x+1) or remove 0's before taking log.

hannesdatta commented 2 years ago

Discussed and workflow has been tested by @srosh2000. Thanks!

hannesdatta / covid-19-book-consumption

Test workflow #1