Questions on how to load the dataset without database

sebimo / LegalSum

Codebase for the summarization of German court rulings

5 stars 1 forks source link

Questions on how to load the dataset without database #1

Open dennlinger opened 2 years ago

dennlinger commented 2 years ago

Hi, first of all, very grateful to see more German-centric datasets, especially for the legal domain! I'm currently trying to have a look at the dataset, however, I'm struggling to follow some of the implicit assumptions about the data, and how to interpret the .json files that are in the downloadable Dropbox file.

To be precise, the main questions are the following:

In load_verdict(), it is indicated that there are (at most) two paragraphs in the guiding principle. Is there any explanation for this particular processing step that I'm missing?
./data/filter_files implies that there are additional filters steps in place. Are these filters already applied to the data available in the Dropbox download, or would I have to apply those filters myself?
Related to the previous point, the files in ./model/ contain less than the 100k samples mentioned in the paper. In fact, they contain 79937 (train), 9992 (validation) and 9993 references, which sums to 99922 files. The downloaded Dropbox file contains 100018 samples, so a discrepancy of some hundred files. I haven't (yet) checked which ones exactly are missing, but it would be great to maybe hear your point on this.
From a "traditional summarization perspective", I'm missing clear references to what constitutes a "reference text" in a single sample. From my understanding, the summary is basically the "guiding principle", however, I could not find any reference to what fields you used for the reference. Is it simply the concatenation of facts & reasoning (again, based on the values returned in load_verdict)?

Many thanks in advance to take the time to answer these questions! Best, Dennis

sebimo commented 2 years ago

Hi Dennis,

Thank you for interest in our research work! I will add some information considering the JSON format such that its schema will become clearer. For data access you can directly use the load_verdict function which loads a list of sentences for each of the different judgment sections. We directly worked with the JSON files in our research.

Considering your questions:

There are two types of guiding principles: from the official judgment or from third parties. In this case we did not differentiate between them as generation targets. Each verdict only has one of both.
Some judgments were discarded during the preprocessing steps (e.g. judgment or summary has no content,...). The filter files were used as a log to keep track of the used judgments. The dataset only contains the final collections and you do not need to apply any further processing steps.
I assume that this is discrepancy happened during the final steps of manual processing and for some reason the judgments were split on a slightly older version of the dataset (without the final ~ 100 judgments). I will further investigate this.
The summary in this case is the "guiding principle" which is used in many cases to describe the main legal reasoning. In this work we always used the entire text (facts + reasoning) as a reference text. From a legal standpoint only using the reasoning part will probably be sufficient, but in this case we did not want to restrict the models too much. In some cases, we also encoded both the facts and reasoning parts separately to see if a different aggregation functionality is beneficial.

I hope this answers your questions. If you have any further questions, please do not hesitate to ask.

Best regards, Sebastian

dennlinger commented 2 years ago

Hi Sebastian, thanks for your detailed response! I have two minor follow-up questions, but I think for the largest part my questions about the processing have been fully answered.

One clarifying question on the guiding principle: Did you investigate what the distribution of available official/third party writings is? It would be interesting to see whether those differ in their textual structure or other aspects that might be important for the summary generation, but I'm not familiar enough with the legal context to judge the differences here.

Otherwise, you mention that you used different aggregation methods (separate encodings vs concatenation) for the "facts & reasoning" texts. I assume you did not find any meaningful differences? This is particularly interesting because many other domains present in summarization datasets exhibit strong positional biases (e.g., news articles), which gets obfuscated for concatenated texts.

Thanks again! Best, Dennis

sebimo commented 2 years ago

Hi Dennis, No, we did not further investigate the distributions of official/third party summaries. I would assume that there is roughly a 50/50 split between them. I have to note though that the level of detail might be different for the different author types. Some official summaries tend to be rather short and more high level, but it is hard to quantify this observation and generalize it to the entire dataset. But as I said, we did not look too much into this.

We primarily used the separate encoding strategy for abstractive summarization, to reduce the amount of information a model needs to keep track of in one embedding. For extractive summarization we only used the concatenation, so we did not directly compare both. One thing I want to note here: The language of the facts and the reasoning is different and there are often certain language cues which indicate the beginning of the reasoning part. As a consequence a model should be able to differentiate them.

Considering biases: There does not seem to be an obvious bias towards selecting sentences from the beginning of either text part, as seen in the comparison between the random sentence selection with the lead baselines. And we could also not find any clear indicator for positional bias when investigating the used labels for extractive summarization. But this can also be due to the level of abstractiveness/novelty of the summaries.

Best regards, Sebastian

dennlinger commented 2 years ago

Perfect, thanks a lot again for the detailed explanations! I'll close this for now, since all of my points have been addressed.

dennlinger commented 2 years ago

Hi Sebastian, sorry to reopen the issue again, but when going through the samples, I noticed an inconsistency with the above statement:

There are two types of guiding principles: from the official judgment or from third parties. In this case we did not differentiate between them as generation targets. Each verdict only has one of both.

I was assuming that this refers to first, respectively second, element in the "guiding_principle" segment of the JSON documents. When looking at samples, however, I noticed that some documents have content in both segments, which seems confusing to me. Again, load_verdict is not helping here, since it naively concatenates both fields, irrespective of the actual content. As an example for the encountered points, I looked at the file Y-300-Z-BECKRS-B-2016-N-67530.json.
EDIT: FWIW, I have just checked, having both at the same time affects about 1200-1300 samples total (~1.25%).

Also, some further questions:

Can you clarify where the third-party statements are coming from? The paper mentions the Otto Schmidt Verlag, is this correct?
Finally, there seems to be also a mention of the "Tenor" in some samples, which I can also see. However, besides some mention of it in Sections 1 & 3 of the paper, it does not seem to be used further in the actual summarization portion. Is there any additional usage elsewhere that I might have missed?

Best, Dennis

sebimo commented 2 years ago

Hi Dennis,

Sorry I misremembered. The splitting of the guiding principle was based on whether they were annotated as "redaktionell" or "amtlich" and then assigned accordingly. So "redaktionell" would be the third-party statements in this case. As both were always coming from the same judgment, we concatenated them when using them as a target summary. Considering the source of the "redaktionell"/third-party summaries: I cannot make a statement for all of them, but many are given by a legal publisher.

Considering the Tenor: It can be seen as the summary of the legal consequences. But this segment is not so interesting to study for summarization as it can be generated by a template in most cases. So, who is in the right, who has to pay, etc.

Additionally, I added some information about the JSON files.

Best regards, Sebastian