mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

loading data error #508

Closed myngpog closed 7 months ago

myngpog commented 7 months ago

when i run this command: sh load_tweets_parallel.sh

my load pg_denormalized works fine and I think my load pg_normalized_batch also works fine inputting all the tweets and stuff (basically the output's like a wall of text) but when it reaches the end, I get this error:

...
protected998': False, 'url998': 'http://www.kyhi.org', 'verified998': False, 'withheld_in_countries998': None, 'statuses_count999': 14190, 'name999': 'D1969\U0001f9e2🍎🦉', 'location999': None, 'listed_count999': 4, 'id_users999': 1163126754057363456, 'updated_at999': 'Wed Jan 06 23:00:20 +0000 2021', 'description999': 'Enjoy working with the public!', 'favourites_count999': 8053, 'screen_name999': 'D196910', 'created_at999': 'Sun Aug 18 16:34:12 +0000 2019', 'friends_count999': 3226, 'protected999': False, 'url999': None, 'verified999': False, 'withheld_in_countries999': None}]
(Background on this error at: https://sqlalche.me/e/14/f405)
Command exited with non-zero status 10
242.34user 31.56system 0:32.90elapsed 832%CPU (0avgtext+0avgdata 3056156maxresident)k
856inputs+20432outputs (4major+17874760minor)pagefaults 0swaps

i transferred the exact same file from my working postgres parallel assignment that passed on github issues

ains-arch commented 7 months ago

What's the wall of text? If it's not something that looks like the unzipping counter counting up i=n then I don't think it's working.

Based on that one line, you might also look at this

Then I reran load_tweets_parallel.sh and got different errors, seemingly related to the urls column of the user table. If that's you I'd double check that the columns in the schema he gives us for this homework match the columns you used (and therefore reference in load_tweets_batch.py) in the previous homework and, if not, change your .py.

myngpog commented 7 months ago

Capture

ik pasitng is the norm but this is what i was talking about. it goes on for a long time until it encounters the already pasted error and I assume it's the tweets but I'm not too sure.

Do you have a screenshot of what it's supported to output? this is normalized batch btw, denormalizes works as intended.

ains-arch commented 7 months ago

yeah so I guess when the load tweets scripts are running and get SQL errors they print the entire query to the screen which is great except our data is Big and therefore it makes the errors kind of unreadable

here's what normalized batch is supposed to output:

================================================================================
load pg_normalized_batch
================================================================================
2024-04-14 18:21:40.620323 /data/tweets/geoTwitter21-01-05.zip
2024-04-14 18:21:59.757400 insert_tweets i= 0
2024-04-14 18:22:02.224318 insert_tweets i= 1
2024-04-14 18:22:02.801547 insert_tweets i= 2
...
2024-04-14 19:10:42.905619 insert_tweets i= 134
2024-04-14 19:10:43.426641 insert_tweets i= 135
2024-04-14 18:21:40.615824 /data/tweets/geoTwitter21-01-04.zip
2024-04-14 18:21:59.758113 insert_tweets i= 0
2024-04-14 18:22:00.759176 insert_tweets i= 1
...
2024-04-14 19:16:10.368355 insert_tweets i= 140
2024-04-14 19:16:10.873256 insert_tweets i= 141
26382.04user 1072.78system 54:42.40elapsed 836%CPU (0avgtext+0avgdata 4500084maxresident)k
24inputs+90496outputs (0major+517876602minor)pagefaults 0swaps

it's about 31600 lines in total. i may have fiddled with the print statement at some point such that it's printing every tweet instead of every 100 tweets or something i'm honestly not sure but. broadly this is what it should look like