Usage issues for Cutter

scottkleinman commented 2 years ago

The most commonly-used Cutter methods will likely be:

split(): Splits based on a fixed number of tokens
splintn(): Splits based on a fixed number of segments

The version of these methods in the Ginsu class take a spaCy doc or list of spaCy docs as input and return lists of spaCy docs as output. In the Machete class, the input and output are strings.

The output looks like this, where each segment is either a spaCy doc or a string, depending on which class you use:

output = [
    [segment1, segment2, ...],
    [segment1, segment2, ...],
    ...
]

Each internal list of segments corresponds to an original text or doc. This is logical, but it can be confusing, especially if your starting point is a single document, where you now have to reference its segments as output[0]. Depending on what you intend to do with the segments, you may want to flatten the list anyway.

So my question is whether we should have some kind of argument to return the data in different formats, or possibly a helper method to convert this data structure into something more intuitive and easy to manipulate without multiple for loops.

A second question concern is that the list of list structure only preserves the relationship between segments and their source texts/documents by way of list indices. Would it be better to use a dict format (or even a dataclass)?

jackMurray20 commented 2 years ago

The Ginsu class also seems to be cutting spaCy docs into strings (similar to the Machete method), rather than into new spaCy docs as it should be.

scottkleinman commented 2 years ago

I've just run it locally, and I am having trouble reproducing this. Try pulling the latest code, just to ensure that's what you are running. Then try the following script.

# Python imports
from lexos import tokenizer
from lexos.io.smart import Loader
from lexos.cutter import Ginsu

# You may have to change the paths
data = ["lexos/tests/test_data/txt/Austen_Pride.txt", "lexos/tests/test_data/txt/Austen_Sense.txt"] 
loader = Loader()
loader.load(data)

# Convert the texts to spaCy docs
docs = tokenizer.make_docs(loader.texts)

# Cut the docs
cutter = Ginsu()
list_of_segmented_docs = cutter.splitn(docs, n=3)

"""
This should be a list with the format

[[segment1, segment2, ...], [segment1, segment2, ...], ...]

where each segment is a spaCy doc.
"""

# Print extracts from the doc segments
print(f"Number of docs: {len(list_of_segmented_docs)}\n")
for i, segmented_doc in enumerate(list_of_segmented_docs):
    print(f"Doc {i+1}:\n")
    for j, segment in enumerate(segmented_doc):
        print(f"Segment {j+1}:\n")
        print(segment.text[0:25])
        print()

Each segment should be a spaCy doc, and to get the text, you have to reference its text attribute.

scottkleinman commented 1 year ago

Thinking about this further, if a flat list is required on output, this can be achieved with

flat_list = [segment for segments in output for segment in segments]

Also, the Cutter module does itself handle any project metadata, so the only context available to it is the order of the docs passed to its methods. If the user needs it in dict format, this is as simple as

mydict = {i: segs for i, segs in enumerate(output)}

There does not seem to be value in adding helper functions which the user can handle with one line of code.

scottkleinman / lexos

Usage issues for Cutter #5