parsing sentences from dataset

sbmaruf commented 6 years ago

https://github.com/glample/tagger/blob/master/loader.py#L8 In this function while you are reading sentences form dataset, you ignore the sentence which is started with DOCSTART. I guess DOCSTART means starting of a new document. Why are you ignoring the first sentence of a new document? or Did I have some problem understanding your code?

glample commented 6 years ago

I ignored DOCSTART because this token is not really a part of the document, it is not a sentence inside of which you need to tag named entities. But you can remove this condition, it would not make any difference.

sbmaruf commented 6 years ago

I understand you are ignoring 'DOCSTART'. But why did you ignore the sentence after DOCSTART. Assume a dataset like following, [From dutch dataset]

-DOCSTART- -DOCSTART- O De Art O tekst N O van Prep O het Art O arrest N O is V O nog Adv O niet Adv O schriftelijk Adj O beschikbaar Adj O maar Conj O het Art O bericht N O werd V O alvast Adv O bekendgemaakt V O door Prep O een Art O communicatiebureau N O dat Conj O Floralux N B-ORG inhuurde V O . Punc O

In Prep O '81 Num O regulariseert V O de Art O toenmalige Adj O Vlaamse Adj B-MISC regering N O de Art O toestand N O met Prep O een Art O BPA N B-MISC dat Pron O het Art O bedrijf N O op Prep O eigen Pron O kosten N O heeft V O laten V O opstellen V O . Punc O

In this case your function 'load_sentences' would not read the sentence,

"De tekst van het arrest is nog niet schriftelijk beschikbaar maar het bericht werd alvast bekendgemaakt door een communicatiebureau dat Floralux in huurde."

Instead, it will start from the second line.

"In '81 regulariseert ..."

Is there any reason why you did this?

glample commented 6 years ago

Sorry for the delay. In practice I think there is an empty line after each DOCSTART symbol, so if you add an empty line before the first sentence, it will not be skipped. No?

sbmaruf commented 6 years ago

Thanks glample for your reply. I did also assume that from English data-set, but there was no empty line after -DOCSTART- in dutch dataset. But I guess this should not change the results too much.

glample / tagger

parsing sentences from dataset #67