Combining COSMOS blocks makes mention location information less specific

In PR https://github.com/ml4ai/automates/pull/263, we combine cosmos blocks to make sure paragraphs are not split up (that happens at the end of a column in two-column papers and at the end of pages). When we combine blocks, the location of extracted mentions becomes less specific---instead of saying Mention 1 comes from p. 1 block 1, we are saying Mention 1 comes from p. 1 block 1-2, and the mention can be located either in block 1, block 2, or be split between the two blocks. Keeping track of length of each block in characters and knowing the character offset of the extraction based on the combined block content can help narrow it down.

Note: Currently, COSMOS combines some paragraphs into longer blocks. This needs to be discussed with UW.

ml4ai / automates

Combining COSMOS blocks makes mention location information less specific #267