From Orens email:
one quick question, in the 10-fold cross validation, did we make sure
that there are no shared people between two sections? i mean, if we
just divide into 10 sections according to tweets, then we may have
tweet1 and tweet2 of the same congressmanA in two different sections.
in this case, we may get a good result in the cross validation simply
because the classifier can find similarity between tweets of
congressmanA (e.g. if tweet1 is in the verification section, and
tweet2 is in one of the 9 training sections, it may simply learn that
tweet1 and tweet2 are similar in language and we'll get a good
misleading score...).
This needs to be fixed by custom code to distribute.
Original issue reported on code.google.com by markus.neubrand on 5 Apr 2011 at 11:55
Original issue reported on code.google.com by
markus.neubrand
on 5 Apr 2011 at 11:55