ml4ai / automates

AutoMATES: Automated Model Assembly from Text, Equations, and Software
https://ml4ai.github.io/automates
Other
25 stars 9 forks source link

Combining COSMOS blocks makes mention location information less specific #267

Open maxaalexeeva opened 2 years ago

maxaalexeeva commented 2 years ago

In PR https://github.com/ml4ai/automates/pull/263, we combine cosmos blocks to make sure paragraphs are not split up (that happens at the end of a column in two-column papers and at the end of pages). When we combine blocks, the location of extracted mentions becomes less specific---instead of saying Mention 1 comes from p. 1 block 1, we are saying Mention 1 comes from p. 1 block 1-2, and the mention can be located either in block 1, block 2, or be split between the two blocks. Keeping track of length of each block in characters and knowing the character offset of the extraction based on the combined block content can help narrow it down.

Note: Currently, COSMOS combines some paragraphs into longer blocks. This needs to be discussed with UW.