mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

UndefinedColumn Error #495

Closed danzhechen closed 7 months ago

danzhechen commented 7 months ago

Hi there,

I am working on the part of Normalized Data (batched). I did finish editing all the requirements and I even deleted all the FOREIGN KEYS. But when I run this command to test:

time echo "$files" | parallel python3 -u load_tweets_batch.py --db "postgresql://postgres:pass@localhost:8399" --inputs $files

I already assigned the value to $files in my environment. I still keep getting two error messages.

NOTICE:  identifier "load_tweets.py --inputs data/geoTwitter21-01-01.zip data/geoTwitter21-01-02.zip data/geoTwitter21-01-03.zip data/geoTwitter21-01-04.zip data/geoTwitter21-01-05.zip data/geoTwitter21-01-06.zip data/geoTwitter21-01-07.zip data/geoTwitter21-01-08.zip data/geoTwitter21-01-09.zip data/geoTwitter21-01-10.zip data/geoTwitter21-01-07.zip" will be truncated to "load_tweets.py --inputs data/geoTwitter21-01-01.zip data/geoTwi"

I think the bigger problem is this one:

Traceback (most recent call last):
  File "/home/Danzhe.Chen.24/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1911, in _execute_context
    cursor, statement, parameters, context
  File "/home/Danzhe.Chen.24/.local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column "url" of relation "users" does not exist
LINE 1: ...count,created_at,description,screen_name,location,url,protec...
                                                             ^

Anyone has an idea of what might cause the problem?

abizermamnoon commented 7 months ago

In load_tweets_batch.py, you have not replaced all instances of id_urls with url. To do that I would suggest pressing ESC and then typing /id_urls.

In this case I think you have not replaced id_urls in users table in load_tweets_batch.py with url

danzhechen commented 7 months ago

Hi Abizer @abizermamnoon ,

I tried your method, and I think I did make all the changes for id_urls. I tried to bring the docker up and down, cleaned the volume, but it still had the same issue.

mmendiratta27 commented 7 months ago

Hi @danzhechen, please let me know if you find a solution to this error, as I am encountering the same thing :(

gibsonfriedman commented 7 months ago

@danzhechen @mmendiratta27 I was getting this same error from a missed change in the load_tweets_batch.py file. I would definitely recommend going over that file as well as the schema.sql file as a small missed edit or typo could be causing that issue.

mikeizbicki commented 7 months ago

One possible (and common) cause of errors like this that hasn't been mentioned yet is that you've modified your schema.sql file, but you haven't deleted the volume and rebuilt the image properly. I would also double check those two steps.

danzhechen commented 7 months ago

Thanks for everyone's help. I think I still have the same issue on lambda server but I somehow passed on the GitHub. I checked everything, I stopped all the dockers, I deleted the volume and rebuilt the image. I do not why, I think I will come to office hour for that issue.

mmendiratta27 commented 7 months ago

I have the same experience as @danzhechen. How should we fill out the time section @mikeizbicki? I have run times for everything except pg_normalized_batch for parallel running.

mikeizbicki commented 7 months ago

@mmendiratta27 I'm fairly confident this error is due to not correctly rebuilding your image. (It works on github because there is nothing to rebuild.) In order to record the timings, you will have to get it working on the lambda server. And if it's not working, you won't get credit for that part. I would be happy to help you figure out the problem after class/in office hours tomorrow.

danzhechen commented 7 months ago

Thanks to the help of @abizermamnoon. And Mike is right. I think my problem should be not delete my images. Be sure to delete your images, delete the volume and build the whole things up again. I have a strange issue saying conflict: unable to delete. I forced the image to be deletes and it works now.

vitorvavolizza commented 7 months ago

i ran docker volume ls, then got all volume names and ran docker volume rm volume_name1 volume_name2 volume_name3.... for all volumes, then randocker-compose down, docker-compose up -d, and ./load_tweets_parallel.sh. that cleaned everything for me.