Closed scottkleinman closed 1 year ago
The Ginsu
class also seems to be cutting spaCy docs into strings (similar to the Machete
method), rather than into new spaCy docs as it should be.
I've just run it locally, and I am having trouble reproducing this. Try pulling the latest code, just to ensure that's what you are running. Then try the following script.
# Python imports
from lexos import tokenizer
from lexos.io.smart import Loader
from lexos.cutter import Ginsu
# You may have to change the paths
data = ["lexos/tests/test_data/txt/Austen_Pride.txt", "lexos/tests/test_data/txt/Austen_Sense.txt"]
loader = Loader()
loader.load(data)
# Convert the texts to spaCy docs
docs = tokenizer.make_docs(loader.texts)
# Cut the docs
cutter = Ginsu()
list_of_segmented_docs = cutter.splitn(docs, n=3)
"""
This should be a list with the format
[[segment1, segment2, ...], [segment1, segment2, ...], ...]
where each segment is a spaCy doc.
"""
# Print extracts from the doc segments
print(f"Number of docs: {len(list_of_segmented_docs)}\n")
for i, segmented_doc in enumerate(list_of_segmented_docs):
print(f"Doc {i+1}:\n")
for j, segment in enumerate(segmented_doc):
print(f"Segment {j+1}:\n")
print(segment.text[0:25])
print()
Each segment
should be a spaCy doc, and to get the text, you have to reference its text
attribute.
Thinking about this further, if a flat list is required on output, this can be achieved with
flat_list = [segment for segments in output for segment in segments]
Also, the Cutter
module does itself handle any project metadata, so the only context available to it is the order of the docs passed to its methods. If the user needs it in dict format, this is as simple as
mydict = {i: segs for i, segs in enumerate(output)}
There does not seem to be value in adding helper functions which the user can handle with one line of code.
The most commonly-used Cutter methods will likely be:
split()
: Splits based on a fixed number of tokenssplintn()
: Splits based on a fixed number of segmentsThe version of these methods in the
Ginsu
class take a spaCy doc or list of spaCy docs as input and return lists of spaCy docs as output. In theMachete
class, the input and output are strings.The output looks like this, where each segment is either a spaCy doc or a string, depending on which class you use:
Each internal list of segments corresponds to an original text or doc. This is logical, but it can be confusing, especially if your starting point is a single document, where you now have to reference its segments as
output[0]
. Depending on what you intend to do with the segments, you may want to flatten the list anyway.So my question is whether we should have some kind of argument to return the data in different formats, or possibly a helper method to convert this data structure into something more intuitive and easy to manipulate without multiple
for
loops.A second question concern is that the list of list structure only preserves the relationship between segments and their source texts/documents by way of list indices. Would it be better to use a dict format (or even a dataclass)?