lexeme-dev / core

A project to develop novel algorithms and analysis techniques for legal research
4 stars 0 forks source link

Lexcaliber

Lexcaliber is an ongoing project to develop novel algorithms and analysis techniques for legal research.

Our current efforts include:

See Eksombatchai et. al (2017), Huang et. al (2021), Sun et. al (2016) for literature that informs our current approaches

Our current work is focused on the federal appellate corpus (all circuit courts as well as the Supreme Court), with the aim of building systems that generalize to other jurisdictions.

This repository contains the bulk of the logic and infrastructure powering this project, as well as command-line and REST interfaces. See lexcaliber/explorer for more information about the prototype web interface we're building to demonstrate the technology.

Results

The main thrust of our efforts so far has been in recommendation and discovery. Given some information (a relevant case or two; key words or phrases; a document in progress) from the user, we would like to examine the ~1,000,000 document strong federal appellate corpus and recommend relevant opinions which aid the user’s research or argument.

Methodology

Our initial results have been very promising. The primary metric we are currently using is recall, the percentage of documents defined as relevant that we are successfully able to recommend. We adopt the measurement approach taken by Huang et. al (2021).

  1. We select a random opinion in the federal corpus and remove it from our network (as if the opinion never existed).
  2. We input all but one of the opinion’s neighbors into the recommendation software.
  3. We measure whether the omitted neighbor was the top recommendation, in the top 5 recommendations, or in the top 20 recommendations.

Our initial results are as follows:

For 20 cases after 5  trials each:
        top1: 10.0%
        top5: 21.0%
        top20: 35.0%
Majority vote control for 20 cases after 5  trials each:
        top1: 0.0%
        top5: 0.0%
        top20: 0.0%

If we restrict the cases to those with at least five neighbors (reasonable, considering that there are many orders/slip opinions with no or few citations), our results are even better:

For 20 cases after 5  trials each:
        top1: 18.0%
        top5: 30.0%
        top20: 47.0%
Majority vote control for 20 cases after 5  trials each:
        top1: 0.0%
        top5: 0.0%
        top20: 0.0%

These results are comparable to Huang et. al. 2021 in light of the much larger federal appellate corpus and, in our view, portend significantly more potential to generalize to other jurisdictions. We further expect these results to improve once we consider textual citation context as part of our recommendation computation.

Getting set up

  1. Set PROJECT_PATH to the github directory, using .env or standard bashrc
  2. Set a psql server hostname, port, and username + password if necessary in .env (making sure you create the empty database)
  3. To set up the database schema, alembic upgrade head. Make sure you have a username in .env.
  4. To install the CLI, run in the main project directory: pip install --editable . Run lxc --help for a list of all commands.
  5. To populate your database with data from CourtListener, run lxc data download with your desired jurisdictions.
  6. To run the API server: lxc server run

Bonus: Run git config blame.ignoreRevsFile .git-blame-ignore-revs so your git blame doesn't catch our reformatting commits.

Migrations

A visualization of the most important cases in Roe v. Wade's egonet

4000 important SCOTUS cases by RolX classification