konklone / jss

JSON Simple Syndication -- RSS rethought for JSON
https://github.com/konklone/jss/issues/10
6 stars 0 forks source link

Code and Laws #3

Open vdavez opened 10 years ago

vdavez commented 10 years ago

Taking Scout, CourtListener, and OpenLIMS seriously... are there any changes we'd envision to the spec?

adelevie commented 10 years ago

Let's lay out the specs here. Can you dig up the guide for CourtListener scrapers? (or just tell me what to google :) )

vdavez commented 10 years ago

Here's the template for the CourtListener scraper

vdavez commented 10 years ago

And here's the engine that makes it tick

adelevie commented 10 years ago

And page of conforming scrapers: https://bitbucket.org/mlissner/juriscraper/src/c7faed035749a5428583535d61d64129ef49806c/opinions/united_states/state/?at=default (yours: https://bitbucket.org/mlissner/juriscraper/src/c7faed035749a5428583535d61d64129ef49806c/opinions/united_states/state/dc.py?at=default)

adelevie commented 10 years ago
"""Scraper for the D.C. Court of Appeals
CourtID: dc
Court Short Name: D.C.
Author: V. David Zvenyach
Date created:2014-02-21
"""

import time
from datetime import date
from datetime import datetime
from lxml import html

from juriscraper.GenericSite import GenericSite
# from juriscraper.lib.string_utils import titlecase

class Site(GenericSite):
    def __init__(self):
        super(Site, self).__init__()
        self.court_id = self.__module__
        self.url = 'http://www.dccourts.gov/internet/opinionlocator.jsf'

    # tr.class="opinionTR1" or "opinionTR2"
    # <tr class="opinionTR1"><td class="opinionC1"><a href="/internet/documents/13-BG-955.pdf" target="13-BG-955.pdf">13-BG-955</a></td><td class="opinionC2">In re: Scott B. Gilly</td><td class="opinionC3">Nov 7, 2013</td><td class="opinionC4"></td><td class="opinionC5"></td></tr>

    #.opinionC1/a[@href]
    def _get_download_urls(self):
        return [t for t in self.html.xpath('//table//tr/td[1]/a/@href')]

    #.opinionC2
    def _get_case_names(self):
        return [t for t in self.html.xpath('//table//tr/td[2]/text()')]

    #.opinionC3 (e.g., Nov 7, 2013)
    def _get_case_dates(self):
        return [date.fromtimestamp(time.mktime(time.strptime(date_string, '%b %d, %Y'))) for date_string in self.html.xpath('//table//tr/td[3]/text()')]

    def _get_precedential_statuses(self):
        return ["Published"] * len(self.case_names)

    #.opinionC1 value (e.g., 13-BG-955)
    def _get_docket_numbers(self):
        return [t for t in self.html.xpath('//table//tr/td[1]/a/text()')]