s-brekke / IUROPA-uploader

Upload data to the IUROPA server
0 stars 0 forks source link

Defendant wrongly defined when reading from pdf #3

Closed s-brekke closed 2 years ago

s-brekke commented 2 years ago

In case ECLI:EU:C:1988:457, "Synetairistiki Eteria Viomichanikis Anaptixeos Thrakis Sevath ABE" is listet as a defendant rather than an applicant. This seems to follow from a failure to observe the "v" between the parties. The script therefore reads the "and" before the last applicant as the sign separating the parties.

s-brekke commented 2 years ago

In theory, issues such as this could be resolved by matching with the applicants in case names in joined cases; The company is the applicant in C-120/87, so the issue could be resolved when this case is scraped.

However, this is made difficult here as case C-120/87 is listed simply as Sevath. The script would have to guess that "Sevath" is the same as "Synetairistiki Eteria Viomichanikis Anaptixeos Thrakis Sevath ABE", at which case it begins to get messy.

It would maybe be possible to draw from the order in which applicants are listed, or simply the fact that they are numbered.

s-brekke commented 2 years ago

Status: the pdfclean function appears to fail to remove all case titles listed in page headings, making " v "s appear where they shouldn't be.

These are identified in the headline3 element, but were not previously removed for some reason. Instead it was solved via a hack in readpdf.