gregdurrett / berkeley-doc-summarizer

The Berkeley Document Summarizer is a learning-based, single-document summarization system that extracts source document content, exploits syntactic information to compress it, and uses coreference constraints to ensure clarity.
GNU General Public License v3.0
742 stars 64 forks source link

Issue with formats used #4

Open h22roscoe opened 7 years ago

h22roscoe commented 7 years ago

Hi @gregdurrett

I am currently using the Entity Preprocessing Driver main method to turn my regular .txt files into the (Conll?) format understood by this summarizer however I am getting issues at the moment with the ConllReader class used in the Summarizer class unable to parse some of the generated lines (in the assembleConstTree method because some lines appear to be missing a "*")

Would you be able to shed more light on the Conll format that the summarizer is expecting?

Thanks, Harry

h22roscoe commented 7 years ago

Ok, I have resolved the issue here by making sure the docName has no whitespace characters but now I get a warning that there are no gold mentions on the document.

gregdurrett commented 7 years ago

Hi Harry,

Glad you figured out the first issue -- guess that should be documented...

The no gold mentions warning is normal -- basically this isn't gold coreference data so we don't expect to have gold labels. (I cribbed a bunch of code from the berkeley-entity system, which was originally a berkeley coref system that did expect gold coref information everywhere.)

Greg

h22roscoe commented 7 years ago

Thanks Greg