CLD2 result chunk vector omits portions of input file

GoogleCodeExporter commented 8 years ago

Hello,

I'm trying to extract natural language from a web crawl for use in NLP 
applications. Since web pages often have multiple languages on them, I'm using 
CLD2's ResultChunkVector API to split each page into chunks of known uniform 
language. The problem I'm running into is that fairly often, the 
ResultChunkVector simply doesn't include parts of the input text -- I've 
attached two sample files that demonstrate this. In 32200.utf8, the first chunk 
starts at position 8 -- I guess this has something to do with the fact that the 
file starts with numbers/punctuation? In 27878255.utf8, the first chunk covers 
positions 0-65530, and the second chunk begins at position 199884 (so there's a 
very substantial amount of text being skipped! and the text appears to be plain 
old English, nothing special) -- I guess this might have something to do with 
the use of a 2-byte length field, but the length of the first chunk isn't 
2**16. And perhaps there are other cases that also lead to gaps like this.

My expectation was that the first chunk would always start at position 0, that 
each chunk would start where the previous one ended, and that the last chunk 
would end at the end of the input file. Or, if this isn't possible, then is 
there any guidance on how gaps like this should be interpreted? I could simply 
pretend they were tagged "unknown", but this seems like a pretty weird way to 
handle the 140 kB of English text in 27878255.utf8.

I'm using the "full" detector, but these files trigger the behaviour in both 
full and regular modes (slightly differently).

Original issue reported on code.google.com by njs@vorpus.org on 1 Jul 2014 at 11:04

Attachments:

GoogleCodeExporter commented 8 years ago

I've done a few more experiments, which made the situation a bit clearer to me.

I applied the attached patch to my local copy. This patch allows CLD2 to report 
any chunk size up to 1 GiB, instead of being artificially limited to 64 KiB. 
Then I ran it over ~56,000 web pages (~4 GB of text). The patch eliminated all 
the cases where one chunk ended before the next one started. I also tried 
running the stock CLD2, but ignoring the 'length' field and instead treating 
each chunk as ending at the point when the next chunk begins. It turned out 
that both of these approaches (patched CLD2, and stock CLD2 + ignoring length 
field) produced identical output on my test corpus, which I take as evidence 
that artificially limited length field is the *only* reason we see "gaps" in 
the middle of the chunk output.

So as a workaround, my plan (and recommendation to any other users who stumble 
across this) is: (1) If the first chunk begins after position 0, then pretend 
there's an extra chunk covering positions 0 through <first chunk.offset> with 
language tag "un". (2) Ignore the length fields in all cases; they can only 
mislead you. Instead look at the "offset" fields, and treat each chunk as 
running from its offset until the offset of the next chunk (or the end of the 
file for the last chunk).

It would be nice if these changes were made directly in CLD2, though, to avoid 
the need for such workarounds.

Original comment by njs@vorpus.org on 4 Jul 2014 at 2:30

Attachments:

bigchunks.diff

GoogleCodeExporter commented 8 years ago

This seems like a reasonable request to me in principle. My first guess was 
that the limit comes from using 'int' as the buffer length in 
DetectLanguageSummaryV2, but CLD2 defines int to be int32, so that can't be it. 
The motivation may simply have been saving space, I don't know.

Most likely this is just a use case that hasn't been prevalent enough to be a 
problem. I think the proposed patch is entirely reasonable. I'll ping Dick and 
see if he has any objections to putting this in. I don't.

Original comment by andrewha...@google.com on 25 Jul 2014 at 10:08

GoogleCodeExporter commented 8 years ago

PS - Thank you for taking the time to make and upload a patch.

Original comment by andrewha...@google.com on 25 Jul 2014 at 10:08

GoogleCodeExporter commented 8 years ago

You're welcome, and hope it helps :-)

Note that the patch only implements one half of my suggestion (stopping the 
spans from being truncated too early), not the other half (inserting an "un" 
span at the beginning of files that begin with punctuation/whitespace).

Original comment by njs@vorpus.org on 26 Jul 2014 at 9:52

GoogleCodeExporter commented 8 years ago

I have pinged Dick about this and I believe (thought can't speak for him 
directly) that he's also in support of this. Hopefully we'll get this fixed 
shortly, thanks again for the report.

Original comment by andrewha...@google.com on 27 Oct 2014 at 8:42

GoogleCodeExporter commented 8 years ago

Fixed in svn revisions 170-176. The ResultChunk output now covers all the bytes 
of the input buffer, with the byte length field now increased to 32 bits and 
the endpoints explicitly covered. Thank you for finding this.

Original comment by dsi...@google.com on 28 Oct 2014 at 9:13

Changed state: Fixed

marcoippolito / cld2

CLD2 result chunk vector omits portions of input file #17