data identification/consolidation storyboard

this issue proposes a storyboard to help attendees understand where they might find data and some items they may need to consider to use it for their GenAI MVP. They'll need to consider data sources, acquiring and transforming the data in a way that supports training goals, and identifying what kind of ML models are required to obtain their desired outcomes.

Assumption: data lives in different databases across the firm and will need to be identified, consolidated, cleaned reshaped to suit training purposes.

Talking points:

identify sources of data in the firm and build partnerships with project/data owners
get safe (e.g. read only) access to the data (follow compliance/PII rules, avoid impacting operations)
make sure you have enough data... need some for training, some for testing
cleaning it is the hard part... can be hacked initially, but expect to spend time developing a production quality process later

proposed problem statement: As a team member doing an MVP for the job recommender application I need to identify a high level workflow to compare a set of skills from a person to a set of skills attached to a open roles in HR and identify the 5 closest candidates so i can ensure the best alignment of skills and roles inside the firm

This can take several forms

initiated by the candidate looking for their next career challenge in the firm
triggered by reorg activities shifting valuable resources from one part of the firm to another part
triggered by innovation cycles looking to pull hidden valuable skills to participate in new projects that require "new" skills (e.g. web v.01 drew from print/layout talent pool)

one possible approach

data is pulled from external sample database(s) and consolidated in a staging database (move sample data from GCP Postgres / CockroachDB Cloud to BigQuery)
feed data from bigquery to Vertex to create embeddings and store in vector database
query from Vector database to identify unexpected matches

Resources:

irvnet / job-rec-app

data identification/consolidation storyboard #9