Open correlator opened 8 years ago
Hi, thank you for peeking in, @vykster is an awesome ally and colleague and I look forward to working with you on this as well.
To clarify, we are attempting to take job classification specifications, which are posted online in similar/same formats all in one place as html/aspx documents (http://calhr.ca.gov/state-hr-professionals/Pages/job-descriptions.aspx). There is one page for each class specification.
The end goal is to use xpath/css selectors or other method to pull the relevant values inside these documents and store them in a data structure that can then be further analyzed, linked and API-ified
Examples of what we want to pull from one class such as Associate Governmental Program Analyst (AGPA) class code 5393 (http://calhr.ca.gov/state-hr-professionals/pages/5393.aspx) might be:
The tricky part is when we get to sections like "Minimum Qualifications" because there are multiple ways to meet MQs for a class, in this example it is:
If not in a DB, I imagine this lends itself to a hierarchical structure like json/xml but I trust others know better than I, what would be appropriate.
The great news is, I've already done items 1 and 2 (taking master list of class codes, retrieving and caching all the data) and items 3, 4, 5, 7 are exactly correct what we need assistance with!
The cached files are located here, they have html extensions because that's what I specified when I used urllib to fetch, but they are originally at .aspx endpoints. Doesn't really affect the content at all, but just a FYI I noticed after the fact.
I'll be out of the country for the next month in Asia but will check in periodically. A huge thank you for your interest, look forward to what we can do together and demonstrate how powerful and effective civic citizens can be.
Hi Joseph, Vyki Englert sent me over here to check this project out and it seems pretty awesome/ambitious. I had a bit of trouble understanding the state of affairs and where you are and where you're going.
If I understand correctly all jobs are currently posted online in various formats across the web. You aim to scrape all of that data and put it into a system that can be searched / analyzed.
I would be happy to help a bit with this but could use a little clarification on the tasks. If I had to guess I would say
Is this similar to what you have in mind? Where in the process are we currently? I would be interested in helping with 3, 4, 5, and 7. Happy to work, rubber ducky or help in any small way I can.