How to export or parse annotation in the unit of line ?

thetaaaaa commented 3 months ago

Is your feature request related to a problem? Please describe. I uploaded and labeled a single large text file which has many lines, and I labeled it under bart(line-oriented) mode. When I exported the annotation, there is only one .xmi file, which means there is only one sofa_text. The problem is , when trainnning deep learning model, I need to upload the sofa_text and its labeled span position information at the same time. The entire sofa_text is too long for deep learning model to proceed. How can I export the annotation result and its .xmi file in the unit of line ?

Describe the solution you'd like I labeled a large text file under bart(line-oriented) mode. I want to export annotation results in the unit of line, which keeps consistent with the bart(line-oriented) mode. For example, if a text file has 10 lines, I would like the exported annotation result contains 10 .xmi files, which also means I would get 10 small sofa_text.

Describe alternatives you've considered I tried to use dkpro-cassis to parse events annotation in each line, however, dkpro-cassis can only parse annotation in the unit of sentence(which has already auto segmented from line), rather than line. Below is my code, in which only segmentation.type.Sentence works.

Hence, as to alternatives, I would like dkpro-cassis be able to parse .xmi in the unit of line rather than sentence, because for a single event, event roles are often distributed in different sentences.

Additional context I think the main problem is the exported .xmi file , the file is mandatorily organized in the unit of sentence rather than line,just as below: my raw text file has just 203 lines , but the .xmi file has more than 2000 sentences.

reckart commented 3 months ago

INCEpTION does not add any Line annotations, so dkpro-cassis cannot find any.

You can enhance your Python script to obtain the sofa text from the CAS, split it into lines, calculate the begin/end of each line, then select the annotations you are interested in and filter out those not in a particular line segment.

ChatGPT suggests something like this:

sofa_text = """This is the first line.
This is the second line.
And this is the third line."""

# Function to split text into lines and remember their positions
def get_line_positions(text):
    lines = text.splitlines()
    positions = []
    current_pos = 0

    for line in lines:
        start_pos = current_pos
        end_pos = start_pos + len(line)
        positions.append((start_pos, end_pos))
        # Increment current_pos by length of line + 1 (for the newline character)
        current_pos = end_pos + 1

    return positions

# Function to filter annotations within given line boundaries
def filter_annotations_within_line_boundaries(annotations, start, end):
    filtered_annotations = [annotation for annotation in annotations if start <= annotation.begin and annotation.end <= end]
    return filtered_annotations

# Split the text into lines and get their positions
line_positions = get_line_positions(sofa_text)

annotations = cas.select('my.Annotation')

# Iterate over each line's begin/end positions
for start, end in line_positions:
    # Filter annotations within the current line boundaries
    filtered_annotations = filter_annotations_within_line_boundaries(annotations, start, end)
    # Print or process the filtered annotations
    print(f"Line from position {start} to {end} has annotations:")
    for annotation in filtered_annotations:
        print(f"  Annotation from {annotation.begin} to {annotation.end}")

reckart commented 3 months ago

Btw. why use a window at all? Maybe it would be easier if you just did cas.select('webanno.custom.Trigger') directly?

thetaaaaa commented 3 months ago

Thank you. Because by using cas.select('webanno.custom.Trigger') can only get the sofa_text of the entire document, but cannot get the sofa_text of the exact line where the event embeded.

reckart commented 3 months ago

Assuming you first build yourself an index of the begin/end positions of each lines (cf. the code suggested above), you could then find the line based on the offsets of your trigger. But really, a line is a bit arbitrary, no? You could also just generate a fixed window before the begin of a trigger and after the end of a trigger.

thetaaaaa commented 3 months ago

That is a good idea. I will try, thank you !

inception-project / inception

How to export or parse annotation in the unit of line ? #4947