jjangsangy / ExplainToMe

Automatic Web Article Summarizer
Apache License 2.0
414 stars 60 forks source link

Languages other than English #3

Open ghost opened 8 years ago

ghost commented 8 years ago

Hi there, thanks for the cool project. The bottom of the README says the support for other languages is a thing to look forward to -- could you elaborate on it a bit? Any particular plans? Let me know if you're looking for contributors that could handle different languages.

jjangsangy commented 8 years ago

Sure. So ExplainToMe currently does 3 things.

  1. Grabs HTML from Webpage
  2. Extracts the main article components.
  3. Generates semantic graph and computes it's centroid.

Currently #1, #2 do not care about language, mostly dealing with HTML and webpage metadata. #3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance (sometimes we can discover in HTML), we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get

ghost commented 8 years ago

Cool. I take it you only use sumy as the summarisation platform? It seems to support Czech, French, German, Portuguese, Slovak, and Spanish out-of-the-box (the stop words for these languages are included in the package).

On 16 Aug 2016, at 21:18, Sang Han notifications@github.com wrote:

Sure. So ExplainToMe currently does 3 things.

Grabs HTML from Webpage Extracts the main article components. Generates semantic graph and computes it's centroid. Currently #1 https://github.com/jjangsangy/ExplainToMe/issues/1, #2 https://github.com/jjangsangy/ExplainToMe/issues/2 do not care about language, mostly dealing with HTML and webpage metadata. #3 https://github.com/jjangsangy/ExplainToMe/issues/3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance, we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jjangsangy/ExplainToMe/issues/3#issuecomment-240190643, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDKDuCFRSqFkUTrWD6Gb4rwCtoBRyugks5qgf75gaJpZM4JlrRb.

jjangsangy commented 8 years ago

Correct. Sumy provides the right framework for building document Summarizer as well as the most popular techniques implemented.

My main concern about adding more languages is I can't really attest to their accuracy in an intuitive way. My experience with cross-language NLP is that techniques vary on effectiveness based on latent cultural features.

gioferreira commented 7 years ago

I'd love to help with Portuguese (Brazilian Portuguese). I've been looking for something like this in Portuguese for ages.

jjangsangy commented 7 years ago

Awesome. Where I would start looking is under textrank.py. There is a function called run_summarizer that takes in a keyword argument language. Currently there is no function for detecting the language, so you'll have to write one based on either metadata, HTML meta tag, or by introducing some library to detect the language.

jjangsangy commented 7 years ago

Heads up I'm making some changes that will be pushed upstream maybe this or next week. It shouldn't effect any code in textrank.py or the original api.

The code however does move a lot of files around. Mostly I've split the application into the flask server that only displays the webpage and a summarization backend which runs asynchronously on aws lambda. I've mostly been running the public heroku server for demo, but it's getting costly to maintain it even if it's not that much every month