Closed ains-arch closed 6 months ago
Hi! Frankly, I just didn't want to deal with the existing data that we've been using in past assignments, so within my load_data.py file I generated fake data for each of my tables.
I used command line arguments in my call to load_data.sh that looks like ./load_data.sh <num_users> <num_tweets> <num_urls>
so that my Python file knows how many fake rows to generate for each table, and I ensured that the constraints of my database schema were upheld within the code to create these fake users, tweets, and urls.
Some fields are easier to make up values for (like anything with a PRIMARY KEY
constraint), but constraints like REFERENCE
and UNIQUE
can be more difficult to guarantee.
If you decide to go this route instead of the existing Twitter data, then hopefully my comment is helpful!
I personally find random data easier to work with like @westondcrewe said. But you all are welcome to use the real twitter data if that's easier.
I'm currently using the twitter data + schema from the pg_normalized database from twitter_postgres_parallel but that's not enough rows to meet the requirements for this assignment. I really... do not want to deal with the data from indexes again but is that the only dataset we've worked with that's big enough? Can I somehow insert the twitter data from twitter_postgres_parallel multiple times since data integrity doesn't matter for this project afaik? Would appreciate any insights on how people have approached this.