Closed GoogleCodeExporter closed 8 years ago
I've done a few more experiments, which made the situation a bit clearer to me.
I applied the attached patch to my local copy. This patch allows CLD2 to report
any chunk size up to 1 GiB, instead of being artificially limited to 64 KiB.
Then I ran it over ~56,000 web pages (~4 GB of text). The patch eliminated all
the cases where one chunk ended before the next one started. I also tried
running the stock CLD2, but ignoring the 'length' field and instead treating
each chunk as ending at the point when the next chunk begins. It turned out
that both of these approaches (patched CLD2, and stock CLD2 + ignoring length
field) produced identical output on my test corpus, which I take as evidence
that artificially limited length field is the *only* reason we see "gaps" in
the middle of the chunk output.
So as a workaround, my plan (and recommendation to any other users who stumble
across this) is: (1) If the first chunk begins after position 0, then pretend
there's an extra chunk covering positions 0 through <first chunk.offset> with
language tag "un". (2) Ignore the length fields in all cases; they can only
mislead you. Instead look at the "offset" fields, and treat each chunk as
running from its offset until the offset of the next chunk (or the end of the
file for the last chunk).
It would be nice if these changes were made directly in CLD2, though, to avoid
the need for such workarounds.
Original comment by njs@vorpus.org
on 4 Jul 2014 at 2:30
Attachments:
This seems like a reasonable request to me in principle. My first guess was
that the limit comes from using 'int' as the buffer length in
DetectLanguageSummaryV2, but CLD2 defines int to be int32, so that can't be it.
The motivation may simply have been saving space, I don't know.
Most likely this is just a use case that hasn't been prevalent enough to be a
problem. I think the proposed patch is entirely reasonable. I'll ping Dick and
see if he has any objections to putting this in. I don't.
Original comment by andrewha...@google.com
on 25 Jul 2014 at 10:08
PS - Thank you for taking the time to make and upload a patch.
Original comment by andrewha...@google.com
on 25 Jul 2014 at 10:08
You're welcome, and hope it helps :-)
Note that the patch only implements one half of my suggestion (stopping the
spans from being truncated too early), not the other half (inserting an "un"
span at the beginning of files that begin with punctuation/whitespace).
Original comment by njs@vorpus.org
on 26 Jul 2014 at 9:52
I have pinged Dick about this and I believe (thought can't speak for him
directly) that he's also in support of this. Hopefully we'll get this fixed
shortly, thanks again for the report.
Original comment by andrewha...@google.com
on 27 Oct 2014 at 8:42
Fixed in svn revisions 170-176. The ResultChunk output now covers all the bytes
of the input buffer, with the byte length field now increased to 32 bits and
the endpoints explicitly covered. Thank you for finding this.
Original comment by dsi...@google.com
on 28 Oct 2014 at 9:13
Original issue reported on code.google.com by
njs@vorpus.org
on 1 Jul 2014 at 11:04Attachments: