josephlei / ca-jobs-schema

Exploring ALL California Job Classification Schemas
0 stars 6 forks source link

Interesting project #2

Open correlator opened 8 years ago

correlator commented 8 years ago

Hi Joseph, Vyki Englert sent me over here to check this project out and it seems pretty awesome/ambitious. I had a bit of trouble understanding the state of affairs and where you are and where you're going.

If I understand correctly all jobs are currently posted online in various formats across the web. You aim to scrape all of that data and put it into a system that can be searched / analyzed.

I would be happy to help a bit with this but could use a little clarification on the tasks. If I had to guess I would say

  1. Find all sources of state jobs online.
  2. Figure out how to access that data and classify as structured, semi-structured, or unstructured.
  3. Build a schema that is the target of how all jobs should be represented.
  4. Implement schema in Neo4j or some other NoSQL db.
  5. Build ETLs for all the different job data sources by classification.
  6. Migrate data over.
  7. Build API for data for search and exposing the data to people who want to analyze the data.

Is this similar to what you have in mind? Where in the process are we currently? I would be interested in helping with 3, 4, 5, and 7. Happy to work, rubber ducky or help in any small way I can.

josephlei commented 8 years ago

Hi, thank you for peeking in, @vykster is an awesome ally and colleague and I look forward to working with you on this as well.

To clarify, we are attempting to take job classification specifications, which are posted online in similar/same formats all in one place as html/aspx documents (http://calhr.ca.gov/state-hr-professionals/Pages/job-descriptions.aspx). There is one page for each class specification.

The end goal is to use xpath/css selectors or other method to pull the relevant values inside these documents and store them in a data structure that can then be further analyzed, linked and API-ified

Examples of what we want to pull from one class such as Associate Governmental Program Analyst (AGPA) class code 5393 (http://calhr.ca.gov/state-hr-professionals/pages/5393.aspx) might be:

The tricky part is when we get to sections like "Minimum Qualifications" because there are multiple ways to meet MQs for a class, in this example it is:

If not in a DB, I imagine this lends itself to a hierarchical structure like json/xml but I trust others know better than I, what would be appropriate.

The great news is, I've already done items 1 and 2 (taking master list of class codes, retrieving and caching all the data) and items 3, 4, 5, 7 are exactly correct what we need assistance with!

The cached files are located here, they have html extensions because that's what I specified when I used urllib to fetch, but they are originally at .aspx endpoints. Doesn't really affect the content at all, but just a FYI I noticed after the fact.

I'll be out of the country for the next month in Asia but will check in periodically. A huge thank you for your interest, look forward to what we can do together and demonstrate how powerful and effective civic citizens can be.