hamishmorgan / ERL

Entity Recognition and Linking
1 stars 0 forks source link

Chunking of whitespace #7

Open language-engineering opened 11 years ago

language-engineering commented 11 years ago

Currently, all whitespace between chunked entities is considered a chunk of its own (when returned at the end of the pipeline in JSON format).

This is probably not desirable. Perhaps a better output would be to ignore between-chunk whitespace, and label each chunk with the start and end indices from the original sentence (ignore trailing whitespace and considering each unicode codepoint as a single character).

hamishmorgan commented 11 years ago

I'm not convinced about this. When linking plain text ut's generally desirable to get the full document text back, including whitespace. It seem presumptuous to insist that whitespace redundant in client application, whether it occurs between chunks or at then end. So on the one hand I see decent argument for maintain whitespace intact, and on the other I see no reason to remove it.

Could you provide your reasoning behind why the current behaviour is "not desirable"?

language-engineering commented 11 years ago

I didn't see this email! Haven't you already made it so it no longer includes whitespace in the output?

Why is it desirable? My (perhaps flawed) way of seeing it is that the chunking and linking are providing a structure and attributes. Whitespace will never be linked, and is only representative of gaps between chunks (sometimes). So in a chunking structure it has no place.

Anything that uses the chunks in terms of machine learning will most likely have to remove these whitespace chunks. Given that the first step in most machine learning pipelines is to tokenise (thereby removing space), says to me how unimportant the whitespace is.

I could imagine an application (perhaps like your webapp) that simply wants to display the text with the links as hyperlinks over the original text. This is where spaces are important. But given this is a display issue, perhaps the answer isn't to burden the chunk structure with whitespace chunks, but instead include information such as the start and end index of the chunk in the original text. This way, the webapp displays the original text and uses the indices to place the hyperlinks.

On 10 Apr 2013, at 15:44, Hamish Morgan wrote:

I'm not convinced about this. When linking plain text ut's generally desirable to get the full document text back, including whitespace. It seem presumptuous to insist that whitespace redundant in client application, whether it occurs between chunks or at then end. So on the one hand I see decent argument for maintain whitespace intact, and on the other I see no reason to remove it.

Could you provide your reasoning behind why the current behaviour is "not desirable"?

— Reply to this email directly or view it on GitHubhttps://github.com/hamishmorgan/ERL/issues/7#issuecomment-16178545.