mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

port not updated after changing both on yml file and .sh file #505

Closed giffiecode closed 7 months ago

giffiecode commented 7 months ago

i've updated the port number to 23451 for the denormalized database on both yml and .sh file. however, when i run the sh load_tweets_parallel I still get an error message connecting to port 54321, which is the port I've used for last hw.

psql: could not connect to server: Connection refused
    Is the server running on host "localhost" (127.0.0.1) and accepting
    TCP/IP connections on port 54321?
Command exited with non-zero status 10

I've run these command to bring the container down and delete the volume docker-compose down docker rm -f $(docker ps -aq) docker volume rm $(docker volume ls -q) docker-compose build docker-compose up -d

ains-arch commented 7 months ago

Is the container showing up when you run docker ps?

giffiecode commented 7 months ago

after bringing it down no

ains-arch commented 7 months ago

But like, after you bring it down and delete volumes and rebuild and bring it up. Then if you run docker ps is it there?

giffiecode commented 7 months ago

after build and up yes

AvidThinkerArsum commented 7 months ago

I have the same exact error. I'm pretty sure my ports matchup between the compose.yml file and the .sh files.

ains-arch commented 7 months ago

@ypei23 Did you change the port in load_denormalized.sh?

giffiecode commented 7 months ago

I have updated the port the load_denormalized.sh. Currently my denormalized seems to work, but my normalized batch is printing out all the tweets in json format

ains-arch commented 7 months ago
ERROR:  invalid input syntax for type json
DETAIL:  Token "mage_url" is invalid.
CONTEXT:  JSON data, line 1: ...d":false,"retweeted":false,"filter_level"mage_url...
COPY tweets_jsonb, line 1410717, column data: "{"created_at":"Tue Jan 05 12:55:30 +0000 2021","id":1346440186884812803,"id_str":"134644018688481280..."
ERROR:  invalid input syntax for type json
DETAIL:  Token "w" is invalid.
CONTEXT:  JSON data, line 1: ...sKite\/status\/1347836811540889601\/photo\/1",""w...
COPY tweets_jsonb, line 1404426, column data: "{"created_at":"Sat Jan 09 12:12:08 +0000 2021","id":1347878823619141632,"id_str":"134787882361914163..."
ERROR:  invalid input syntax for type json
DETAIL:  Expected ":", but found "}".
CONTEXT:  JSON data, line 1: ...rmal.jpg","profile_image_url_xzWT9v7.mp4?tag=10"}...
COPY tweets_jsonb, line 1422985, column data: "{"created_at":"Mon Jan 04 12:53:39 +0000 2021","id":1346077330704277505,"id_str":"134607733070427750..."

are you getting something like this?

and then after that, this:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 1457?

(Background on this error at: https://sqlalche.me/e/14/e3q8)
Command exited with non-zero status 10
ains-arch commented 7 months ago

Update: I got those errors, and checked docker ps and sure enough the container was not on the port I thought it was. I brought everything down and ran the rm rf stuff he gave us, then rebuilt and brought them back up. When I checked docker ps again the containers were listening on the ports I expected them to be.

Then I reran load_tweets_parallel.sh and got different errors, seemingly related to the urls column of the user table. If that's you I'd double check that the columns in the schema he gives us for this homework match the columns you used (and therefore reference in load_tweets_batch.py) in the previous homework and, if not, change your .py.

I think the error is gonna look like a huge mess regardless because we're loading so much in at once and I think it'll show us all the SQL queries for the errors so.. a lot to sift through. Personally I've been running the load_tweets_parallel.sh in the background with nohup so that if I close my laptop it doesn't explode.

nohup ./load_tweets_parallel.sh > output.log 2>&1 &

I had to fiddle with the permissions a little to get it to work but it seems worth it. This has the added benefit of putting all the errors in an output.log file that I can then look at with vim rather than trying to read it all in terminal.

I hope something here is helpful, I'm also working through the homework right now so idk if what I'm doing here is actually gonna work lol

giffiecode commented 7 months ago

I added sizes constraints in my pg_normalized_batch

echo "$files" | time parallel --jobs 1 sizes=1 python3 -u load_tweets_batch.py --db=postgresql://postgres:pass@localhost:7272/ --inputs

but it's still printing a ton of tweets in json format

ains-arch commented 7 months ago

Can I see a little of the json you said is printing? I think that's just what happens when the SQL commands throw errors but I'm not sure if we're looking at the same things.

I think that maybe these files are just too big such that doing one at a time is still gonna output errors that are hard to work with. Have you tried output redirecting to a file and then looking at it?

Or if the error isn't huge you could post the whole thing here?