openlegaldata / oldp

Open Legal Data Platform
https://openlegaldata.io
MIT License
98 stars 17 forks source link

Misaligned reference links in full text #35

Open dennlinger opened 5 years ago

dennlinger commented 5 years ago

For some of the decisions (e.g., this one), the references are not aligned at all with the corresponding occurrences in the text.

Is there any way to work with the data prior to the annotation (as it is available through the JSON), to potentially help with investigating this?

malteos commented 5 years ago

Hi @dennlinger,

thanks for your bug report. We are already aware of this bug but couldn't fix it until now (see https://github.com/openlegaldata/legal-reference-extraction/issues/1 ).

If the original text without any annotation would help you, we could provide it as an additional field in the API response.

Best, Malte

dennlinger commented 5 years ago

Hi Malte, unfortunately didn't see the bug report before. I was more wondering whether you could provide some of the actual samples (raw HTML before processing, maybe from the case referenced in the bug report) used in the dataset for the live webpage to help with the debugging.

The test cases provided in legal-reference-extraction seem simple enough at first glance, and I assume you are checking for correctness on those anyways. I'm aware of legal-datasets, but that one is unfortunately empty as well.

I think the feature is extremely helpful if working properly, and could potentially be extended, if you are willing to accept contributions on this issue.

Best, Dennis

malteos commented 5 years ago

Contributions are always welcome!

I'll try to update the API accordingly within the next week.

malteos commented 5 years ago

The decision content which is currently available via the API does not contain any annotations. Thus, it should not be affected by the reference extraction bug. The API serializer returns the content field that holds the HTML as we obtained it from the source.

For the UI, all annotations are added later (See https://github.com/openlegaldata/oldp/blob/master/oldp/apps/cases/models.py#L186-L209 )

fchrubasik commented 4 years ago

After running some tests (for example on this document) it seems like the references are misaligned because of the HTML-Offset, i.e. replacing special characters like "ö" with "ö". The references are placed as if they were applied to plain text without taking these special characters into account resulting in the misalignment. I am currently working on a bugfix for this issue together with @dennlinger.

malteos commented 4 years ago

Hi @fchrubasik & @dennlinger

thanks again for your contribution! The last months have been really busy over here so I only today managed to finally deploy your changes to production. I'm really sorry for that!

I'm currently reprocessing all our documents with the changes (that might take 10hrs or so).

Did you end up doing anything with the citation data?

Best, Malte

dennlinger commented 4 years ago

Hi, thanks for incorporating the changes! So far we haven't directly used the citations from openlegaldata, but had a Thesis project by another student working on Bafin data and European Directives. As for this patch, let me know if there are any problems coming up. I think there is a chance that depending on your input format, some files are still processed incorrectly, but I'll happily check a bunch of documents once the changes are live. ;-)

Cheers, Dennis

dennlinger commented 4 years ago

Not sure where to follow up with this, but it seems the references are still misaligned on the live server, as it seems. Did we miss anything with the original bugfix that might cause this to be still misaligned?

malteos commented 4 years ago

The case mentioned in the issue seems to have all reference correct ( https://de.openlegaldata.io/case/bag-2019-07-11-6-azr-4017 ). Do you have an example for still misaligend references?

dennlinger commented 4 years ago

I was specifically looking at the most recent "Urteil" at the time of writing (https://de.openlegaldata.io/case/bverwg-2020-08-06-6-b-1120). Great to see that the original issue is fixed, though!

malteos commented 4 years ago

OK. Then let's reopen this one.