Closed blester125 closed 11 months ago
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press
examples seem to be missing some content and the author section includes way to much stuff.
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the
freedom.press
examples seem to be missing some content and the author section includes way to much stuff.
The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the
freedom.press
examples seem to be missing some content and the author section includes way to much stuff.The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the
freedom.press
examples seem to be missing some content and the author section includes way to much stuff.The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the
freedom.press
examples seem to be missing some content and the author section includes way to much stuff.The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?
I dug a little into it, as basically the sitemap parsing we use grabs all the links for a website. As an example, for freedom.press it includes links like this https://freedom.press/foia/obama-admin-secret-opposition-foia-reform/ or people profiles like https://freedom.press/people/kelly-caine/ which aren't formatted the same way a news story like https://freedom.press/news/why-arent-more-journalism-schools-teaching-security-hygiene/ is
So it looks like we either want to look into expanding parsing to be able to handle these other pages (the hardest part is probably configuration, like which parser does this particular page need) or we can filter to just the news story pages (this seems like it could lose some data as these pages aren't totally empty).
My vote would be to merge the processing pipeline code (this pr) and then people can hack on the parser (the utils.parse_page
function) later.
I can make some fixes for freedom.press. But if we want to merge this first that it's also possible for me to make a new PR to adjust freedom.press parsing.
I guess it doesn't hurt to try to get the non-news-article pages, but I don't think we should sink a lot of time into it given that there aren't many of them and they don't have much content. @lintangsutawika I don't think @blester125 was saying it was just on freedom.press that there were issues, right?
I think other sites have a similar case but the number or articles outnumber them (and the news articles are really the main point anyway).
We should filter out the non-article pages then, right?
I think some are worth keeping, we just need to figure out what works and what doesn't, for example, https://freedom.press/training/blog/story-inside-your-software-updates/ is a "training" page is seems to be basically parsed correct and has 1200 words in it. In contrast https://freedom.press/training/secondary-signal-account/ only has its title after parsing but looking on the page it has ~1500 words. I think some pages like any /donate/ can def be filtered out. This filtering seems to be needed on multiple sites (for example 360info has /visual_tags
and tag
pages that don't seems to have any info on them/get parsed to nothing, there are similar /tag
urls in libertytvradio. I assume there are things like this for all the sites.)
It looks like currently the author parsing is over zealous and you get things like "author": "Trevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019"
which leads to double dates at the start of the article "Unanswered questions on the San Francisco police raid of a journalist’s home\nTrevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019\nMay 28, 2019\n...
This happens in multiple sources (like 360info "author": "Authors\nGilda TachedjianBurnet Institute and Monash University"
)
Updated the code to clean up the author and date extraction a lot, I also filter out some pointless pages like /tag/...
. The fully scraped and processed dataset is at https://huggingface.co/datasets/blester125/news-dolma
There are still a few small issues, but need a lot of work to fix, i.e. code that runs for specific sources and whatnot. Those can be addressed in v2
I'm going to merge this in a bit unless someone has objections.
@lintangsutawika if you can look over this PR that would be highly valuable. It should incorporate the changes requested to your original PR.