navigating-stories / orange-story-navigator

Add-on to the Orange3 data mining toolkit with text processing widgets from the project Navigating Stories
https://research-software-directory.org/projects/navigating-stories
Other
2 stars 2 forks source link

Flavio/issue 15 #28

Closed f-hafner closed 5 months ago

f-hafner commented 5 months ago
Issue

Closes #15

Description of changes

Open questions

The number of segments are computed with the np.array_split. Because the number of segments is now defined at a global level (for all stories), it can create highly unequal segment sizes when there is large variability in the length across stories (and thus, statistical conclusions from comparing segments within a story will be more or less accurate depending on the size of the segment). One way to deal with this is to represent this uncertainty to the user and write a clear documentation about it, perhaps including a hint that the user should inspect the segment length in their stories. Another way could be to let the user define, instead of (or in addition to?) the number of segments, the minimum segment size they want.

Includes
f-hafner commented 5 months ago

Instead of a new dataframe, store the segment_id in a new column in the dataframe with the tags.

f-hafner commented 5 months ago

The output of the tagger now differs from the output without story segmentation: to order of rows is different. @kodymoodley , if this is an issue, let me know and I can try to fix it.

kodymoodley commented 5 months ago

The output of the tagger now differs from the output without story segmentation: to order of rows is different. @kodymoodley , if this is an issue, let me know and I can try to fix it.

@f-hafner By 'output' do you mean the dataframe? And by 'differs' do you mean solely with the additional column indicating the story segment number? If so then there is no issue. Just to be clear, the intention (as per our offline discussion yesterday) is still to retain a single dataframe as the output for the tagger, right?