mandarjoshi90 / coref

BERT for Coreference Resolution
Apache License 2.0
440 stars 92 forks source link

Sentence index when splitting long sentences into non-overlapping chunks #98

Open nikarjunagi opened 2 years ago

nikarjunagi commented 2 years ago

Hi @mandarjoshi90, thanks much for this awesome library.

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).

My questions:

  1. Is this a valid approach? (in line with your response to another question here: https://github.com/mandarjoshi90/coref/issues/33)
  2. Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.