scrape data data from Google Maps. Extracts data such as the name, address, phone number, website URL, rating, reviews number, latitude and longitude, reviews,email and more for each place
Running this repo with a psql server on AWS RDS hangs after about 5 jobs have been completed. Even before it hangs, the process is significantly slower (much more than expected) than compared to doing it with a psql server on localhost.
While I can work around this limitation, it would be nice if the code could directly export to the remotely hosted database server.
========
Update: Original errors was due to a simple mistake in usage.
When working with a database, the -email flag should only be used when executing the jobs already populated in the gmaps_jobs table. Having the flag on when creating the jobs results in errors.
The correct usage is:
#Add the jobs to the queue in the database table: gmaps_jobs
go run main.go \
-dsn $DSN \
-produce \
-input example-queries.txt \
-lang en
#execute the jobs in the queue
go run main.go \
-c 3 \
-depth 3 \
-dsn $DSN \
-email
Everything below this is now irrelevant to the issue.
When trying to use this repo in conjuction with AWS RDS to host a PostgreSQL server, I encounter some errors.
To setup the RDS database, the "gmaps_jobs" table was made using the create_tables.up.sql script,
and I manually made the "results" table with 2 columns:
id : integer : primary_key & not_null
data : jsonb : not_null
~~
Running the following code to queue the jobs. It fills the gmaps_jobs table as expected.
export DSN="postgres://postgres:postgres@[aws-endpoint]:5432/postgres" \
#Add the jobs to the queue in the database table: gmaps_jobs
go run main.go \
-dsn $DSN \
-produce \
-input example-queries.txt \
-email
However when running the 2nd part,
#execute the jobs in the queue
go run main.go \
-c 3 \
-depth 3 \
-dsn $DSN
there are a lot of lines in the logging which state:
{"level":"error","component":"scrapemate","error":"invalid job type: while pushing jobs","time":"2024-02-12T01:05:50.189031649Z","message":"error while finishing job"}
Then the script exits with one of two errors:
Either
ERROR: null value in column "id" of relation "results" violates not-null constraint (SQLSTATE 23502)
(as a 3rd case, sometimes the script just hangs after one of the above "invalid job type" errors)
Do you have any suggestions as to what is causing this?
========================
Edit:
When running the code locally and using a psql server on localhost, the code successfully completes, but the logs show a LOT of the gmaps jobs failing
Running this repo with a psql server on AWS RDS hangs after about 5 jobs have been completed. Even before it hangs, the process is significantly slower (much more than expected) than compared to doing it with a psql server on localhost.
While I can work around this limitation, it would be nice if the code could directly export to the remotely hosted database server. ========
Update: Original errors was due to a simple mistake in usage. When working with a database, the -email flag should only be used when executing the jobs already populated in the gmaps_jobs table. Having the flag on when creating the jobs results in errors.
The correct usage is:
Everything below this is now irrelevant to the issue.