Seems like LingMessCoref comes with a max_doc_len of 4096. Is there anyway to circumvent this for it to work on any document length?
Edit: I am mainly trying to get the character spans of clusters. I tried to overcome this limitation myself by manually chunking the document and piecing together the detected clusters. My chunks have about 4000 tokens each and have an overlap of 2000 tokens. To piece together the overall clusters, I get the coreference pairs for each chunk, and basically make a graph where these coreference pairs are edges, and each connected component is a cluster. However, this approach does not really work well because:
the model seems to perform worse on document chunks (especially those later in a document), probably because information in the earlier sections is already lost?
all it takes is one mistake from the model (e.g. in Chunk 1, the model thinks a "he" is referring to "John", but in Chunk 2, the model thinks the same "he" is referring to "Peter") for mentions of separate entities to be lumped together.
Seems like LingMessCoref comes with a max_doc_len of 4096. Is there anyway to circumvent this for it to work on any document length?
Edit: I am mainly trying to get the character spans of clusters. I tried to overcome this limitation myself by manually chunking the document and piecing together the detected clusters. My chunks have about 4000 tokens each and have an overlap of 2000 tokens. To piece together the overall clusters, I get the coreference pairs for each chunk, and basically make a graph where these coreference pairs are edges, and each connected component is a cluster. However, this approach does not really work well because: