Sentence index when splitting long sentences into non-overlapping chunks

Hi @mandarjoshi90, thanks much for this awesome library.

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).

My questions:

Is this a valid approach? (in line with your response to another question here: https://github.com/mandarjoshi90/coref/issues/33)
Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.

mandarjoshi90 / coref

Sentence index when splitting long sentences into non-overlapping chunks #98