mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

What data are people using for the final project? #561

Closed ains-arch closed 6 months ago

ains-arch commented 6 months ago

I'm currently using the twitter data + schema from the pg_normalized database from twitter_postgres_parallel but that's not enough rows to meet the requirements for this assignment. I really... do not want to deal with the data from indexes again but is that the only dataset we've worked with that's big enough? Can I somehow insert the twitter data from twitter_postgres_parallel multiple times since data integrity doesn't matter for this project afaik? Would appreciate any insights on how people have approached this.

westondcrewe commented 6 months ago

Hi! Frankly, I just didn't want to deal with the existing data that we've been using in past assignments, so within my load_data.py file I generated fake data for each of my tables.

I used command line arguments in my call to load_data.sh that looks like ./load_data.sh <num_users> <num_tweets> <num_urls> so that my Python file knows how many fake rows to generate for each table, and I ensured that the constraints of my database schema were upheld within the code to create these fake users, tweets, and urls.

Some fields are easier to make up values for (like anything with a PRIMARY KEY constraint), but constraints like REFERENCE and UNIQUE can be more difficult to guarantee.

If you decide to go this route instead of the existing Twitter data, then hopefully my comment is helpful!

mikeizbicki commented 6 months ago

I personally find random data easier to work with like @westondcrewe said. But you all are welcome to use the real twitter data if that's easier.