Interesting project - Githubissues

Hi Joseph, Vyki Englert sent me over here to check this project out and it seems pretty awesome/ambitious. I had a bit of trouble understanding the state of affairs and where you are and where you're going.

If I understand correctly all jobs are currently posted online in various formats across the web. You aim to scrape all of that data and put it into a system that can be searched / analyzed.

I would be happy to help a bit with this but could use a little clarification on the tasks. If I had to guess I would say

Find all sources of state jobs online.
Figure out how to access that data and classify as structured, semi-structured, or unstructured.
Build a schema that is the target of how all jobs should be represented.
Implement schema in Neo4j or some other NoSQL db.
Build ETLs for all the different job data sources by classification.
Migrate data over.
Build API for data for search and exposing the data to people who want to analyze the data.

Is this similar to what you have in mind? Where in the process are we currently? I would be interested in helping with 3, 4, 5, and 7. Happy to work, rubber ducky or help in any small way I can.

Hi, thank you for peeking in, @vykster is an awesome ally and colleague and I look forward to working with you on this as well.

To clarify, we are attempting to take job classification specifications, which are posted online in similar/same formats all in one place as html/aspx documents (http://calhr.ca.gov/state-hr-professionals/Pages/job-descriptions.aspx). There is one page for each class specification.

The end goal is to use xpath/css selectors or other method to pull the relevant values inside these documents and store them in a data structure that can then be further analyzed, linked and API-ified

Examples of what we want to pull from one class such as Associate Governmental Program Analyst (AGPA) class code 5393 (http://calhr.ca.gov/state-hr-professionals/pages/5393.aspx) might be:

Schematic code (KEY) with a (VALUE) of JY35
Definition (KEY) with a (VALUE) of the string:
- Under direction, incumbents perform the more responsible, varied, and complex technical analytical staff services assignments such as program evaluation and planning; policy analysis and formulation; systems development; budgeting, planning, management, and personnel analysis; and continually provide consultative services to management or others. This is the full journey level analyst class. Incumbents are typically subject-matter generalists who have demonstrated possession of intellectual abilities, the management tools, and the personal qualifications to succeed in a variety of general staff services settings.

The tricky part is when we get to sections like "Minimum Qualifications" because there are multiple ways to meet MQs for a class, in this example it is:

EDUCATION AND
- EXPERIENCE PATH 1 OR
- EXPERIENCE PATH 2

If not in a DB, I imagine this lends itself to a hierarchical structure like json/xml but I trust others know better than I, what would be appropriate.

The great news is, I've already done items 1 and 2 (taking master list of class codes, retrieving and caching all the data) and items 3, 4, 5, 7 are exactly correct what we need assistance with!

The cached files are located here, they have html extensions because that's what I specified when I used urllib to fetch, but they are originally at .aspx endpoints. Doesn't really affect the content at all, but just a FYI I noticed after the fact.

I'll be out of the country for the next month in Asia but will check in periodically. A huge thank you for your interest, look forward to what we can do together and demonstrate how powerful and effective civic citizens can be.

josephlei / ca-jobs-schema

Interesting project #2