Closed jeremybmerrill closed 1 month ago
Generally looking good. I think we can find a way to get this in eventually.
Here are some initial questions:
x, y, w, h
convention for a bounding box? I'm not opposed to it, I'm just curious about the rationale for it. I can imagine other conventions, like the GeoJSON bounding box array or an x1, x2, y1, y2
format or the different types described here.bs4
in the repository? If not, can we get that cut from the Pipfile and setup.py as well? Could html5lib
go too?I've cut the bs4 dependency, as to question 3.
As to question 2, the only obvious difference I see is that KCAU-TV returns an empty list with my PR, but a two-element Captcha-related set of links with the main branch. This is not a substantive difference, however, since there's no meaningful content at issue.
html5lib removed and subbed in top/left/bottom/right!
re spam: sure, just added. Not sure how to test systematically, but it works for a few sites.
Cool. If you like the idea and it works, let's slap that in here and then ship it out.
Yeah, seems like it works! Just pushed. I think I got all your suggestions added in.
Thank you. I will try to merge and release it tomorrow. I appreciate your interest and initiative.
On Wed, Aug 14, 2024, at 9:29 PM, Jeremy B. Merrill wrote:
Yeah, seems like it works! Just pushed. I think I got all your suggestions added in.
— Reply to this email directly, view it on GitHub https://github.com/palewire/news-homepages/pull/488#issuecomment-2290284297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACOCLX3PILJWFQSRA3VADZRP757AVCNFSM6AAAAABL66J26GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJQGI4DIMRZG4. You are receiving this because you commented.Message ID: @.***>
Thank you for doing all the work over TWELVE YEARS to maintain this awesome dataset. I think there could be some really fun automated media criticism with this dataset. (How much is "Trump" mentioned over time on headlines across NYT/WP/AP/CNN/Fox/MSNBC/etc.? How much is Trump mentioned in the top 2 screens, etc.?)
I made a few tweaks locally and merged your PR into main. The biggest thing: Our timeout variable was going into the "wait" argument of the browser utility. That resulted in a long pause before the links were gathered. I just cut the option. If you disagree with this, please just let me know.
Assuming the unit tests pass, I'm going to cut a release and ship to https://github.com/palewire/news-homepages-runner
awesome, sounds great. thanks! I had not investigated the timeout variable, I just copy pasted it :)
In case you're interested, here's the most sophisticated link analysis I've dinked around with thus far: https://palewi.re/docs/news-homepages/drudge.html
It depends on my "storysniffer" machine learning model to filter out furniture links and trim the scraped list down to only stories. https://palewi.re/docs/storysniffer/
woohoo! thank you for the tweet and for merging this. I'll check out the Drudge thing and maybe have some fun charts to share in coming days.
Re: link analysis, I did some NLP. Here are the top nouns across news orgs:
Here are the top verbs the Post uses for Trump, Biden, Harris, Democrats and Republicans. I'm not sure there's any interesting takeaway from this. Although let me know if you see one!
Storysniffer distinguishes boilerplate links from story links? I tried to do that just by discarding any link that occurred over 10 days. Then I cleaned the headline texts with some regexes and things. (The NYT headlines are a MESS of headline, subhed and photo caption.)
Yes, that's what storysniffer does.
Since Harris, as the candidate, is the newest of news, I think any analysis of how she's being portrayed would be interesting to zoom in on.
Right now, the
hyperlinks
module fetches the homepage via Playwright, then parses the page HTML with BeautifulSoup in Python to identify all the<a>
tags and extract theirhref
and text. With this PR, instead, we extract the link href, text and bounding box info (x/y/w/h) with JavaScript injected into the page with Playwright, returning an Array of Objects into Python (as a list of dicts).Output looks like this (just by chance that I'm opening this PR when I am in the #2 spot on the homepage):
This PR has the advantage of removing the BeautifulSoup dependency from the
hyperlinks
module. It has the disadvantage of introducing possible additional complexity of mixing languages, which just feels gross.I have tested this against the WaPo homepage with
python -m newshomepages.hyperlinks washingtonpost
. With this PR, it takes 13s of "user" time per the output oftime
. The existing main-branch version takes 16s. All of this testing took place under a weird situation where my MBP was swapping a ton, so YMMV, but it seems like this PR at least doesn't make things slower. I haven't tested this at scale, but seems more likely than not that this PR actually speeds stuff up. Happy to test more systematically if you'd like.I have diff'ed the outputs of the main branch code and this PR's code and the difference is minimal, presumably explicable by changes to the underlying site due to a new live update pubbing.
This PR's output is meaningfully larger (+37KB) from the bounding box information: 74217bytes versus 111326b. I previously had the
w
andh
attributes namedwidth
andheight
but cut them back to save 5KB (which adds up over hundreds of sites, several times a day); that file was 116221b. I could be convinced to add that back in case we think it aids with readability.This PR will close #487. Looking forward to your thoughts!