Geoexplorer Data

This repository includes all logic around the data needed for the GeoExplorer - a AI-driven search application for Berlin's geo data. It contains:

A Node.js scraper script to collect the data. 🔗
A script to create and write embeddings to a DB using OpenAIs and Supabase APIs. 🔗
A script to run a Jupyter notebook to analyze and export the embeddings. 🔗

Scraper

The scraper (located in the scraper folder) gets all WFS & WMS related metadata from Berlin's Open Data Portal and Berlin's Geo Data Portal (FisBroker) and writes a markdown file (.mdx) for each dataset. The scraper has multiple steps which you can control in the index.js by (un)commenting them.

Before running the scraper you will need to install npm and the dependencies:

npm i

Run the scraper like so:

npm run scrape

Or if you want to update the data:

npm run scrape:update

Setting up a Supabase DB and creating embeddings

1. Set up a local Supabase DB (optional)

The initialization of the database, including the setup of the pgvector extension is stored in the supabase/migrations folder which is automatically applied to your local Postgres instance when running npx supabase start

Make sure you have Docker installed and running locally. Then run

npx supabase start

This will set up a local Supabase DB for you.

2. Provide connection details

Duplicate the .env.example file and rename it to .env. Then provide either your local connection details or those from Supabase, depending on where you want to save your data.

To retrieve your local NEXT_PUBLIC_SUPABASE_ANON_KEY and SUPABASE_SERVICE_ROLE_KEY run:

npx supabase status

You will also need to provide a key to use OpenAI API.

3. Generate embeddings

This script requests an embedding for each markdown file created earlier. The embedding will then be written to your Supabase DB. To run the script:

npm run embeddings

Note: Make sure Supabase is running. To check, run supabase status. If is not running, run supabase start.

4. Link your local development project to a hosted Supabase project (optional)

You can do this like so (your data will not be uploaded):

npx supabase
npx supabase link SUPABASE_DB_PASSWORD
npx supabase login
npx supabase link --project-ref SUPABASE_DB_PASSWORD
npx supabase db push

Running Jupyter notebook to analyze and export the embeddings.

Go to your graphical interface of your Supabase DB (e.g., http://localhost:54323/project/default/editor) and export the _nods_page_sectionrows table as a .csv file. Save the file in the createGraph folder. Then install jupyter notebook via pip if you haven't installed it yet.

pip install notebook

Run the notebook like so:

npm run embedgraph

This will open a new window in your browser.

You can also access the notebook directly via http://localhost:8888/notebooks/embeds.ipynb.

Run the notebook. It will show you a scatterplot representing the vectors in a 2D representation.

At the bottom of the notebook, you will find a link called _tsnedata.csv. This will allow you to download the 2D coordinates including the titles of the dataset. The data is used to update the scatterplot displayed in the GeoExplorer.