calculation of boilerplate

Jiawen-Yan1 commented 2 years ago

The original language in Lang and Stice-Lawrence (2015 JAE) is "5 times per document" (pg.133), not 5 times in all documents . Consider change the code?

Plus, this line is extremely time inefficient, consider change to set interaction?

Jiawen-Yan1 commented 2 years ago

if any(x in ngram[i][j] for x in fndf.unique_ngrams):

jinhangjiang commented 2 years ago

Yes, you are right. The author pointed out that they only considered the Boilerplates that occur (1) in at least 30% of the documents or (2) on average 5 times per document. The (2) option will be included in one of the future coming batches.

Currently, min_doc is used to filter the uncommon Boilerplates. For example, if you have 15 documents, you can leave it as default 5 (which is roughly 30%). If you have 30 documents, you can increase it to 10. In the future coming batch, the percentage will be accepted as the input value for the parameter min_doc, but as of now, the package only supports integers.

Also, thank you for the code improvement suggestion. Will make some changes as well.

jinhangjiang / morethansentiments

calculation of boilerplate #1