mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

pg_normalized_batch failing test cases #518

Closed luisgomez214 closed 7 months ago

luisgomez214 commented 7 months ago

Hello,

I have made sure to follow the directions regarding bringing down and removing containers/volumes and bringing them back up. I have also made sure my ports match. When I run ' docker-compose exec pg_normalized_batch sh -c 'du -hd0 $PGDATA' ' I get 101 M. When I run 'time docker-compose exec pg_normalized_batch ./run_tests.sh sql.normalized_batch' I fail the test cases. Also, when I run 'sh load_tweets_parallel.sh' I get the output below, so I know there is an issue with my normalized_batch.

2024-04-21 18:31:10.148811 /data/tweets/geoTwitter21-01-07.zip
2024-04-21 18:31:30.993103 insert_tweets i= 0
Traceback (most recent call last):
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1911, in _execute_context
    cursor, statement, parameters, context
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column "url" of relation "users" does not exist
LINE 1: ...NSERT INTO users (id_users,created_at,screen_name,url,listed...
                                                             ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "load_tweets_batch.py", line 458, in <module>
    insert_tweets(connection,tweets,args.batch_size)
  File "load_tweets_batch.py", line 175, in insert_tweets
    _insert_tweets(connection, tweet_batch)
  File "load_tweets_batch.py", line 397, in _insert_tweets
    bulk_insert(connection, 'users', users)
  File "load_tweets_batch.py", line 156, in bulk_insert
    res = connection.execute(sqlalchemy.sql.text(sql), binds)
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1385, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 335, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1587, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1954, in _execute_context
    e, statement, parameters, cursor, context
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2135, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1911, in _execute_context
    cursor, statement, parameters, context
  File "/home/Luis.Gomez.25/.local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column "url" of relation "users" does not exist
LINE 1: ...NSERT INTO users (id_users,created_at,screen_name,url,listed...

Has anyone faced this issue?

mikeizbicki commented 7 months ago

The important line of the error message is

psycopg2.errors.UndefinedColumn: column "url" of relation "users" does not exist
LINE 1: ...NSERT INTO users (id_users,created_at,screen_name,url,listed...

Here, we can see that your python code is inserting into a url column of the users table, but the sql schema I provided you does not contain a url column. I believe when I completed the twitter_postgres assignment I renamed the column to urls (with an s) instead of url. (In retrospect, I'm not sure why. url is a better name for the column than urls because there is only one url being stored in the column. But software projects have a way of building up lots of small inconsistencies like this, and part of being a good engineer is learning how to live with these minor bugs.)

To fix your problem, you could either:

  1. Modify your python code to use the urls column to be compatible with the schema, or
  2. Modify the schema to use the url column to be compatible with the python.

It won't matter which method you pick.