palewire / news-homepages

An open-source archive that gathers, saves, shares and analyzes news homepages
https://homepages.news
GNU General Public License v3.0
126 stars 16 forks source link

add x/y/w/h of link to hyperlinks files by getting links in JS/playwright #488

Closed jeremybmerrill closed 1 month ago

jeremybmerrill commented 1 month ago

Right now, the hyperlinks module fetches the homepage via Playwright, then parses the page HTML with BeautifulSoup in Python to identify all the <a> tags and extract their href and text. With this PR, instead, we extract the link href, text and bounding box info (x/y/w/h) with JavaScript injected into the page with Playwright, returning an Array of Objects into Python (as a list of dicts).

Output looks like this (just by chance that I'm opening this PR when I am in the #2 spot on the homepage):

  {
    "text": "Jeremy B. Merrill",
    "url": "https://www.washingtonpost.com/people/jeremy-b-merrill/",
    "x": 143.171875,
    "y": 1213.484375,
    "w": 83.515625,
    "h": 14.5
  },

This PR has the advantage of removing the BeautifulSoup dependency from the hyperlinks module. It has the disadvantage of introducing possible additional complexity of mixing languages, which just feels gross.

I have tested this against the WaPo homepage with python -m newshomepages.hyperlinks washingtonpost. With this PR, it takes 13s of "user" time per the output of time. The existing main-branch version takes 16s. All of this testing took place under a weird situation where my MBP was swapping a ton, so YMMV, but it seems like this PR at least doesn't make things slower. I haven't tested this at scale, but seems more likely than not that this PR actually speeds stuff up. Happy to test more systematically if you'd like.

I have diff'ed the outputs of the main branch code and this PR's code and the difference is minimal, presumably explicable by changes to the underlying site due to a new live update pubbing.

    > "Trump backs out of ABC debate, says he will only debate Harris on Fox"
    > "“I’ll see her September 4th, or I won’t see her at all,” Trump posted on his social network."
    194d195
    < "GOP senators relished watching Democratic infighting during the summer, neglecting that their nominee would never merely focus on policy issues."
    196d196
    < "Trump backs out of ABC debate, says he will only debate Harris on Fox"

This PR's output is meaningfully larger (+37KB) from the bounding box information: 74217bytes versus 111326b. I previously had the w and h attributes named width and height but cut them back to save 5KB (which adds up over hundreds of sites, several times a day); that file was 116221b. I could be convinced to add that back in case we think it aids with readability.

This PR will close #487. Looking forward to your thoughts!

palewire commented 1 month ago

Generally looking good. I think we can find a way to get this in eventually.

Here are some initial questions:

  1. Why do the x, y, w, h convention for a bounding box? I'm not opposed to it, I'm just curious about the rationale for it. I can imagine other conventions, like the GeoJSON bounding box array or an x1, x2, y1, y2 format or the different types described here.
  2. Do we need any kind of error handling in the JavaScript snippet?
  3. Is there any other use of bs4 in the repository? If not, can we get that cut from the Pipfile and setup.py as well? Could html5lib go too?
jeremybmerrill commented 1 month ago
  1. No strong preference either, nor is this something I spent much time thinking about. I think I was just copying what's available in the JavaScript output. I know we've ended up with a kind of bizarre format (top left bottom right, I think?) in Tabula which has caused some confusion -- but I'm not sure there's a One Right Way.
  2. good question. I can run this code against a bunch of outlets and see what happens? Do you know of any outlets homepages that tend to be more pathological on average?
  3. Appears not. I'll cut it. That's a win.
jeremybmerrill commented 1 month ago

I've cut the bs4 dependency, as to question 3.

As to question 2, the only obvious difference I see is that KCAU-TV returns an empty list with my PR, but a two-element Captcha-related set of links with the main branch. This is not a substantive difference, however, since there's no meaningful content at issue.

jeremybmerrill commented 1 month ago

html5lib removed and subbed in top/left/bottom/right!

jeremybmerrill commented 1 month ago

re spam: sure, just added. Not sure how to test systematically, but it works for a few sites.

palewire commented 1 month ago

Cool. If you like the idea and it works, let's slap that in here and then ship it out.

jeremybmerrill commented 1 month ago

Yeah, seems like it works! Just pushed. I think I got all your suggestions added in.

palewire commented 1 month ago

Thank you. I will try to merge and release it tomorrow. I appreciate your interest and initiative.

On Wed, Aug 14, 2024, at 9:29 PM, Jeremy B. Merrill wrote:

Yeah, seems like it works! Just pushed. I think I got all your suggestions added in.

— Reply to this email directly, view it on GitHub https://github.com/palewire/news-homepages/pull/488#issuecomment-2290284297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACOCLX3PILJWFQSRA3VADZRP757AVCNFSM6AAAAABL66J26GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJQGI4DIMRZG4. You are receiving this because you commented.Message ID: @.***>

jeremybmerrill commented 1 month ago

Thank you for doing all the work over TWELVE YEARS to maintain this awesome dataset. I think there could be some really fun automated media criticism with this dataset. (How much is "Trump" mentioned over time on headlines across NYT/WP/AP/CNN/Fox/MSNBC/etc.? How much is Trump mentioned in the top 2 screens, etc.?)

palewire commented 1 month ago

I made a few tweaks locally and merged your PR into main. The biggest thing: Our timeout variable was going into the "wait" argument of the browser utility. That resulted in a long pause before the links were gathered. I just cut the option. If you disagree with this, please just let me know.

Assuming the unit tests pass, I'm going to cut a release and ship to https://github.com/palewire/news-homepages-runner

jeremybmerrill commented 1 month ago

awesome, sounds great. thanks! I had not investigated the timeout variable, I just copy pasted it :)

palewire commented 1 month ago

In case you're interested, here's the most sophisticated link analysis I've dinked around with thus far: https://palewi.re/docs/news-homepages/drudge.html

It depends on my "storysniffer" machine learning model to filter out furniture links and trim the scraped list down to only stories. https://palewi.re/docs/storysniffer/

palewire commented 1 month ago

We live.

https://archive.org/download/drudge-2024/drudge-2024-08-15T11%3A19%3A15.649322-04%3A00.hyperlinks.json https://x.com/palewire/status/1824105859816808620

jeremybmerrill commented 1 month ago

woohoo! thank you for the tweet and for merging this. I'll check out the Drudge thing and maybe have some fun charts to share in coming days.

jeremybmerrill commented 2 weeks ago

Re: link analysis, I did some NLP. Here are the top nouns across news orgs:

Screenshot 2024-08-27 at 9 10 54 PM

Here are the top verbs the Post uses for Trump, Biden, Harris, Democrats and Republicans. I'm not sure there's any interesting takeaway from this. Although let me know if you see one!

Screenshot 2024-08-27 at 9 00 18 PM

Storysniffer distinguishes boilerplate links from story links? I tried to do that just by discarding any link that occurred over 10 days. Then I cleaned the headline texts with some regexes and things. (The NYT headlines are a MESS of headline, subhed and photo caption.)

palewire commented 2 weeks ago

Yes, that's what storysniffer does.

Since Harris, as the candidate, is the newest of news, I think any analysis of how she's being portrayed would be interesting to zoom in on.