Fix article extraction pipeline

glciampaglia commented 6 years ago

We should move to a better package from text extraction from HTML. The issues are the following:

1) The current API is failing on some sites (e.g. The Onion). 2) We also need to store only the text because it will take less space.

To solve the problem of multiple versions, we should keep only the first version of an article.

[x] We need to identify a good package for doing extraction and test a few candidates on the HTML of our sites. UPDATE: We decided to use two packages: Goose3 and Dragnet.
[x] We will create a cascading system that first uses the local packages (see above), and in case they both fail, and only then, it sends a request to the external Mercury Postlight API.
[x] Since Goose3 is Python3 code, we also need to port Hoaxy to Python3, see #20.
[x] After extraction, we include the text into Lucene, and then we can set the HTML to NULL to save space. (SEE INSTRUCTIONS FROM @shaochengcheng )

@ZacMonroe could help with research of a package for HTML text extraction and with testing it.

ZacMilano commented 6 years ago

BeautifulSoup is a library that I've found that, in my testing, seems to work well in providing all the text that is visible to the user without any element tags (<p>, </script>, etc.). It also can remove markup comments. Here is something I threw together as a proof-of-concept.

filmenczer commented 6 years ago

Thank you @ZacMonroe. Other students have used BeautifulSoup in other projects, so I know it's a powerful parser/scraper. But does it allow to extract the content of an article, leaving out not only scripts and html markup but also other stuff that we do not want (eg, navigation, headers, footers, ads, links to other pages...)? Can you share with us what happens when you apply your script to a bunch of article pages, say, from Onion, InfoWars, Politifact, Snopes, etc?

If BeautifulSoup does not do what we need, consider Python Goose:

https://github.com/grangier/python-goose https://pypi.python.org/pypi/goose-extractor/

It looks like it might be designed to do just what we want...

glciampaglia commented 6 years ago

I remember we did evaluate Python Goose, and for some reason at the time (two years ago) it was not meeting certain requirements, and so we went for what we have currently. Perhaps @shaochengcheng can remind us what was the problem with it? It could be possible that in the meanwhile the project may have matured and could be a viable alternative now.

UPDATE: FYI, it seems that development of Python Goose has moved to a different repository: https://github.com/goose3/goose3

UPDATE 2: I have done some more research and it seems that the Dragnet algorithm is the state of the art on the task, outperforming Python Goose and a few other alternatives (see here). @ZacMonroe, would you like to give Dragnet a try? It could be a good chance to learn how to use a machine learning algorithm!

ZacMilano commented 6 years ago

I will certainly give it a try! Over the last few days I've been trying to use Aylien (example usage of extracting tweets' content), but I just realized that the free version of it only allows it to be called (meaning, the text of an article is extracted) 1000 times per day -- which likely is not enough.

filmenczer commented 6 years ago

We're already using an API, it would be better to extract the content on our server. Goose seems to be what we need, if it works. @shaochengcheng please let us know if you know of issues with Goose. In the meanwhile @ZacMonroe please test it.

glciampaglia commented 6 years ago

May I suggest that @ZacMonroe takes a look at Dragnet first? As I mentioned in my comment, it looks like Dragnet outperforms Python Goose in tests, so it would make sense to give it a try first.

shaochengcheng commented 6 years ago

@ZacMonroe @glciampaglia @filmenczer. We tried Python Goose but decided not to use it due to the followings.

The capacity is not perfect, there are cases that cannot properly be parsed.
Goose https://github.com/grangier/python-goose was out of maintenance.

Thanks to @ZacMonroe and @glciampaglia, now we know that a new version, goose3, is now under maintenance. Based on the observation that there are several article parsers and none of them can guarantee 100 percent correctness, I am thinking to utilize multiple parsers to get better performance. To do it, we need to test all these parsers. Any ideas?

Thanks Chengcheng

filmenczer commented 6 years ago

Any updates? @ZacMonroe, unless you have already done this, we suggest you try each parser (Dragnet, Goose3, Mercury) on one (or a small sample of) article(s) from each source, and then report back on which sites are missed using each tool. Then we can decide if we can just pick one, or need a more complex solution. Thanks!

glciampaglia commented 6 years ago

There is an additonal issue related to the article extraction API. To reduce duplicate results in the frontend we group together articles that have the same title and domain name. However there are examples where Mercury fails to parse the title correctly, and so it groups together different articles from the same website. For example, there are several thousands articles from globalresearch.ca that are grouped under the same ID 275864:

hoaxy=> select id, left(canonical_url, 70) from article where group_id = 275864 limit 20;
   id    |                                  left                                  
---------+------------------------------------------------------------------------
 2079985 | https://www.globalresearch.ca/saudi-effort-to-isolate-iran-internation
 2079984 | https://www.globalresearch.ca/video-italy-behind-the-parade-italys-act
 2079981 | https://www.globalresearch.ca/running-amok-donald-trump-facilitates-ci
 2079974 | https://www.globalresearch.ca/facebook-security-officer-not-all-speech
 2079948 | https://www.globalresearch.ca/how-the-war-industry-corrupts-the-u-s-co
 2079885 | https://www.globalresearch.ca/appeasement-as-global-policy-trumps-will
 2079884 | https://www.globalresearch.ca/dems-put-finishing-touches-on-americas-o
 2079858 | https://www.globalresearch.ca/imf-chief-christine-lagarde-found-guilty
 2079854 | https://www.globalresearch.ca/venezuela-vanguard-of-a-new-world-iran-t
 2079018 | https://www.globalresearch.ca/ex-cia-officer-lists-three-reasons-why-n
 2078924 | https://www.globalresearch.ca/fracking-destroys-the-environment-and-po
 2078854 | https://www.globalresearch.ca/how-washington-has-lost-its-way/5643474
 2078826 | https://www.globalresearch.ca/the-untold-history-of-us-war-crimes/5523
 2078723 | https://www.globalresearch.ca/us-war-crimes-in-syria-exposed/5643482
 2078722 | https://www.globalresearch.ca/video-kurdish-leadership-declares-readin
 2078721 | https://www.globalresearch.ca/apartheid-in-action-israeli-parliament-b
 2078718 | https://www.globalresearch.ca/trumps-withdrawal-from-the-iran-deal-is-
 2078591 | https://www.globalresearch.ca/video-a-arte-da-guerra-por-tras-da-parad
 2078460 | https://www.globalresearch.ca/syrias-press-conference-the-united-natio
 2078392 | http://www.globalresearch.ca/israeli-prime-minister-netanyahu-linked-t
(20 rows)

The reason is that Mercury does not return the full title:

giovanni@frosty [10:51:19 AM] [~] 
-> % curl -H "x-api-key: [...] " "https://mercury.postlight.com/parser?url=https://www.globalresearch.ca/saudi-effort-to-isolate-iran-internationally/5643526" -q | jq .
{
  "title": "Global Research", <------  INCOMPLETE
  "author": "James M. Dorsey",
  "date_published": null,
  "dek": null,

The actual <TITLE> tag is:

<title>Saudi Effort to Isolate Iran Internationally  \|  Global Research - Centre for Research on Globalization</title>

Group ID 275864 includes more than 14,000 articles and is the largest one. However it is not the only one that groups too many articles:

hoaxy=> select group_id, count(*) from article group by group_id order by count(*) desc limit 20;
 group_id | count  
----------+--------
          | 785675
   275864 |  14971
   312930 |   4747
   652220 |   4075
   452722 |   3834
    21669 |   1407
      371 |   1001
   701757 |    809
   185458 |    553
   313148 |    509
     9047 |    500
   172951 |    492
    16769 |    485
    91181 |    414
   367866 |    396
   187350 |    367
   142239 |    296
    53850 |    246
   317990 |    235
    80658 |    233
(20 rows)

filmenczer commented 5 years ago

This is now high priority because the Mercury Web Parser API will be shut down on April 15, 2019. The Mercury Parser code is now free and open-source (https://github.com/postlight/mercury-parser), so we should be able to incorporate it into our pipeline!

glciampaglia commented 5 years ago

This is the code for the Mercury API server: https://github.com/postlight/mercury-parser-api

As mentioned today, it would be much easier to integrate Mercury using the API rather than trying to run it directly from within Hoaxy, which would be highly complex due to the fact that Mercury appears to be written in a different language other than Python (Javascript).

filmenczer commented 5 years ago

@ZacMonroe could you please check newspaper3k and see how it compares to Mercury, Goose3 and Dragnet?

osome-iu / hoaxy-backend

Fix article extraction pipeline #6