Closed glciampaglia closed 5 years ago
BeautifulSoup is a library that I've found that, in my testing, seems to work well in providing all the text that is visible to the user without any element tags (<p>
, </script>
, etc.). It also can remove markup comments. Here is something I threw together as a proof-of-concept.
Thank you @ZacMonroe. Other students have used BeautifulSoup in other projects, so I know it's a powerful parser/scraper. But does it allow to extract the content of an article, leaving out not only scripts and html markup but also other stuff that we do not want (eg, navigation, headers, footers, ads, links to other pages...)? Can you share with us what happens when you apply your script to a bunch of article pages, say, from Onion, InfoWars, Politifact, Snopes, etc?
If BeautifulSoup does not do what we need, consider Python Goose:
https://github.com/grangier/python-goose https://pypi.python.org/pypi/goose-extractor/
It looks like it might be designed to do just what we want...
I remember we did evaluate Python Goose, and for some reason at the time (two years ago) it was not meeting certain requirements, and so we went for what we have currently. Perhaps @shaochengcheng can remind us what was the problem with it? It could be possible that in the meanwhile the project may have matured and could be a viable alternative now.
UPDATE: FYI, it seems that development of Python Goose has moved to a different repository: https://github.com/goose3/goose3
UPDATE 2: I have done some more research and it seems that the Dragnet algorithm is the state of the art on the task, outperforming Python Goose and a few other alternatives (see here). @ZacMonroe, would you like to give Dragnet a try? It could be a good chance to learn how to use a machine learning algorithm!
I will certainly give it a try! Over the last few days I've been trying to use Aylien (example usage of extracting tweets' content), but I just realized that the free version of it only allows it to be called (meaning, the text of an article is extracted) 1000 times per day -- which likely is not enough.
We're already using an API, it would be better to extract the content on our server. Goose seems to be what we need, if it works. @shaochengcheng please let us know if you know of issues with Goose. In the meanwhile @ZacMonroe please test it.
May I suggest that @ZacMonroe takes a look at Dragnet first? As I mentioned in my comment, it looks like Dragnet outperforms Python Goose in tests, so it would make sense to give it a try first.
@ZacMonroe @glciampaglia @filmenczer. We tried Python Goose but decided not to use it due to the followings.
Thanks to @ZacMonroe and @glciampaglia, now we know that a new version, goose3, is now under maintenance. Based on the observation that there are several article parsers and none of them can guarantee 100 percent correctness, I am thinking to utilize multiple parsers to get better performance. To do it, we need to test all these parsers. Any ideas?
Thanks Chengcheng
Any updates? @ZacMonroe, unless you have already done this, we suggest you try each parser (Dragnet, Goose3, Mercury) on one (or a small sample of) article(s) from each source, and then report back on which sites are missed using each tool. Then we can decide if we can just pick one, or need a more complex solution. Thanks!
There is an additonal issue related to the article extraction API. To reduce duplicate results in the frontend we group together articles that have the same title and domain name. However there are examples where Mercury fails to parse the title correctly, and so it groups together different articles from the same website. For example, there are several thousands articles from globalresearch.ca
that are grouped under the same ID 275864
:
hoaxy=> select id, left(canonical_url, 70) from article where group_id = 275864 limit 20;
id | left
---------+------------------------------------------------------------------------
2079985 | https://www.globalresearch.ca/saudi-effort-to-isolate-iran-internation
2079984 | https://www.globalresearch.ca/video-italy-behind-the-parade-italys-act
2079981 | https://www.globalresearch.ca/running-amok-donald-trump-facilitates-ci
2079974 | https://www.globalresearch.ca/facebook-security-officer-not-all-speech
2079948 | https://www.globalresearch.ca/how-the-war-industry-corrupts-the-u-s-co
2079885 | https://www.globalresearch.ca/appeasement-as-global-policy-trumps-will
2079884 | https://www.globalresearch.ca/dems-put-finishing-touches-on-americas-o
2079858 | https://www.globalresearch.ca/imf-chief-christine-lagarde-found-guilty
2079854 | https://www.globalresearch.ca/venezuela-vanguard-of-a-new-world-iran-t
2079018 | https://www.globalresearch.ca/ex-cia-officer-lists-three-reasons-why-n
2078924 | https://www.globalresearch.ca/fracking-destroys-the-environment-and-po
2078854 | https://www.globalresearch.ca/how-washington-has-lost-its-way/5643474
2078826 | https://www.globalresearch.ca/the-untold-history-of-us-war-crimes/5523
2078723 | https://www.globalresearch.ca/us-war-crimes-in-syria-exposed/5643482
2078722 | https://www.globalresearch.ca/video-kurdish-leadership-declares-readin
2078721 | https://www.globalresearch.ca/apartheid-in-action-israeli-parliament-b
2078718 | https://www.globalresearch.ca/trumps-withdrawal-from-the-iran-deal-is-
2078591 | https://www.globalresearch.ca/video-a-arte-da-guerra-por-tras-da-parad
2078460 | https://www.globalresearch.ca/syrias-press-conference-the-united-natio
2078392 | http://www.globalresearch.ca/israeli-prime-minister-netanyahu-linked-t
(20 rows)
The reason is that Mercury does not return the full title:
giovanni@frosty [10:51:19 AM] [~]
-> % curl -H "x-api-key: [...] " "https://mercury.postlight.com/parser?url=https://www.globalresearch.ca/saudi-effort-to-isolate-iran-internationally/5643526" -q | jq .
{
"title": "Global Research", <------ INCOMPLETE
"author": "James M. Dorsey",
"date_published": null,
"dek": null,
The actual <TITLE>
tag is:
<title>Saudi Effort to Isolate Iran Internationally \| Global Research - Centre for Research on Globalization</title>
Group ID 275864
includes more than 14,000 articles and is the largest one. However it is not the only one that groups too many articles:
hoaxy=> select group_id, count(*) from article group by group_id order by count(*) desc limit 20;
group_id | count
----------+--------
| 785675
275864 | 14971
312930 | 4747
652220 | 4075
452722 | 3834
21669 | 1407
371 | 1001
701757 | 809
185458 | 553
313148 | 509
9047 | 500
172951 | 492
16769 | 485
91181 | 414
367866 | 396
187350 | 367
142239 | 296
53850 | 246
317990 | 235
80658 | 233
(20 rows)
This is now high priority because the Mercury Web Parser API will be shut down on April 15, 2019. The Mercury Parser code is now free and open-source (https://github.com/postlight/mercury-parser), so we should be able to incorporate it into our pipeline!
This is the code for the Mercury API server: https://github.com/postlight/mercury-parser-api
As mentioned today, it would be much easier to integrate Mercury using the API rather than trying to run it directly from within Hoaxy, which would be highly complex due to the fact that Mercury appears to be written in a different language other than Python (Javascript).
@ZacMonroe could you please check newspaper3k and see how it compares to Mercury, Goose3 and Dragnet?
We should move to a better package from text extraction from HTML. The issues are the following:
1) The current API is failing on some sites (e.g. The Onion). 2) We also need to store only the text because it will take less space.
To solve the problem of multiple versions, we should keep only the first version of an article.
@ZacMonroe could help with research of a package for HTML text extraction and with testing it.