niklaswais / gesp

https://nwais.de/gesp
MIT License
18 stars 4 forks source link

ECLI? #4

Closed zeuner closed 1 year ago

zeuner commented 1 year ago

How hard would it be to map the current decision identifiers to standardized ECLI? This might be valuable for interoperability with other, even international projects operating on court decisions.

niklaswais commented 1 year ago

@galvusdamor has been working on ECLIs. I also think this is a useful extension and will take care of the implementation. Thanks!

niklaswais commented 1 year ago

I have created a new #feature-ECLI branch and implemented the changes. Currently the ECLIs only play a role when post-processing (or pre-processing, depending on your view) is enabled (flag: -pp). Going forward, I will align the way files are named and organized.

galvusdamor commented 1 year ago

@zeuner @niklaswais As Niklas wrote, I added a feature that "postprocesses" the decisions. This includes three steps:

  1. Extracting the meta-data of the decision that is contained in the website itself (court, date, case id ["Aktenzeichen"], senate, ECLI [if provided] ...)
  2. Based on these information the "correct" ECLI is inferred. The problem is that a. not all decisions provide an ECLI (this is e.g. different in the Netherlands where rechtspraak.nl labels all decisions with an ECLI themselves). For these decisions we use the ECLI we infer, and b. the ECLI given by the official sources is not always correct. For example, the ECLI is supposed to only contain upper case letters - but it does not always. Further there are specific rules in place for adding separators and how to handle non [a-zA-Z0-9]* characters. Some courts seem to ignore these rules or interpret them very, very loosely. We note that the official ECLI is incorrect, but then keep it
  3. Raw extract the raw text of the decision from the website. As gesp downloads websites directly and we have to use various sources (states, federal, judicialis) the content we get is not very uniform. This last step extracts the plain text and removes everything that is website-clutter.

Postprocessing then stores the meta-data and raw-text in a filed named after the case's ECLI (either provided or self-computed).

@niklaswais I've had to add one new court to the ECLI inferece csv. Also the website of Schleswig-Holstein has changed. I've added a commit to fix this to my fork. Can you cherry-pick both changes? Otherwise the #feature-ECLI branch looks clean. Do you want to merge it?

Cheers, Gregor

niklaswais commented 1 year ago

I will cherry-pick the changes. Before I merge both branches, I want to align the folder structures that result from a "normal" and a "-pp" call. Will be done quickly.