sildar / potara

Multi-document summarization tool relying on ILP and sentence fusion
MIT License
82 stars 25 forks source link

Need Help on Importing Docs #10

Closed Tylast closed 4 years ago

Tylast commented 4 years ago

Hi. I'm interested in trying potara. I've cloned it & imported stop words. Can you elaborate on exactly where the documents go? Do I have to create a particular directory? Looks like they have to be in a .txt format. What about inside that txt file...any formatting requirements? What about the title & article executive summary that usually comes with an article? Can you provide an example with real documents?

Thanks! Ty

sildar commented 4 years ago

Hi! Thanks for your interest in potara.

There is no requirement for the text file. It can contain both the title and the executive summary, but these will be treated as if they were part of the main text.

The Document class is useful to take care of all the preprocessing, but you may create your own class and feed them to the Summarizer. You should check the clusterSentences() method in the summarizer to see which fields are required in your custom class.

It's been a while I've worked on this project and I know that the input requirements are a bit lacking. If you want to, don't hesitate and make a pull request.

Can you elaborate on exactly where the documents go?

Anywhere you want, you just have to give the filepath to the document class.

Do I have to create a particular directory ?

Nope, documents are created one by one, using each filepath.

Looks like they have to be in a .txt format.

Yes, txt format. I haven't done a lot of tests regarding encoding, but it should work with utf-8.

Can you provide an example with real documents?

I didn't for two reasons. First, real documents should be real press articles, and there would be a copyright issue. Second, almost real documents would be press articles that I wrote myself for this purpose, and I have never taken the time to do it. If you know of any alternative, let me know and I'll happily make this available.

Cheers