openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
719 stars 38 forks source link

[REVIEW]: Finnish Media Scrapers #3504

Closed whedon closed 2 years ago

whedon commented 3 years ago

Submitting author: @jiemakel (Eetu Mäkelä) Repository: https://github.com/hsci-r/finnish-media-scrapers Version: v1.1.1 Editor: @ajstewartlang Reviewers: @sara-shiho, @GaurangTandon Archive: 10.5281/zenodo.5796453

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/0bdd0f947b2d986b4acc4b9b6737dde8"><img src="https://joss.theoj.org/papers/0bdd0f947b2d986b4acc4b9b6737dde8/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/0bdd0f947b2d986b4acc4b9b6737dde8/status.svg)](https://joss.theoj.org/papers/0bdd0f947b2d986b4acc4b9b6737dde8)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@sara-shiho & @pmyteh, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @ajstewartlang know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @sara-shiho

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @pmyteh

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @GaurangTandon

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @sara-shiho, @pmyteh it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.04 s (910.5 files/s, 53030.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          18            242            194            911
YAML                             4              7             12             92
Markdown                         3             55              0             85
reStructuredText                 4             45             77             53
TOML                             1              5              0             52
TeX                              1              2              0             35
DOS Batch                        1              8              1             26
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            33            368            291           1263
-------------------------------------------------------------------------------

Statistical information for the repository '20911ab5d5684f4c1f1be61c' was
gathered on 2021/07/16.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Eetu Mäkelä                     44          2724           2444           77.33
Pihla Toivanen                  38          1291            224           22.67

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Eetu Mäkelä                1289           47.3          0.7                7.60
Pihla Toivanen               58            4.5          2.0                6.90
whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago

:wave: @sara-shiho, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @pmyteh, please update us on how your review is going (this is an automated reminder).

Sara-ShiHo commented 3 years ago

So far, I have a few questions/suggestions for the authors:

jiemakel commented 3 years ago

In commit https://github.com/hsci-r/finnish-media-scrapers/commit/901bbf522c7451da4928ba4724a69d25ff7605dd, I've tried to answer the questions/suggestions given by @Sara-ShiHo .

More specifically beyond what is now added to the README:

ajstewartlang commented 3 years ago

Hi @Sara-ShiHo and @pmyteh just wondering how your reviews are going?

Sara-ShiHo commented 3 years ago

I just looked through jiemakel's most recent commit & updated my reviewer checklist. I'm going through the functionality right now; I might have some more suggestions/questions.

Sara-ShiHo commented 3 years ago

First off, thank you @jiemakel for your detailed revision. I have a few more suggestions.

Finally, I think the paper is well-written and doesn't need change, but I thought the README documentation for HS could be reorganized like this (although feel free to do differently if you disagree).

This scraper requires a log-in email and password for Helsingin Sanomat. Sign up at https://www.hs.fi/ and add them to the command fms-fetch-hs. For example fms-fetch-hs -i hs-sdp.csv -o hs-sdp -u <email> -p <password>.

Technically, the fms-fetch-hs command uses pyppeteer to control a headless Chromium browser to log in and ensure the dynamically rendered content in HS articles is captured. To ensure a compatible Chromium, when first running the tool, pyppeteer will download an isolated version of Chromium for itself, causing some ~150MB of network traffic and disk space usage.

After fetching the articles, extract texts with fms-html-to-text e.g. fms-html-to-text-hs -o hs-sdp-output hs-sdp.

UMTti commented 3 years ago

I've done the following additions to the repository as @Sara-ShiHo suggested:

ajstewartlang commented 3 years ago

Hi @Sara-ShiHo and @pmyteh - I'm just checking in to see how your reviews are progressing - could you provide an update please?

Sara-ShiHo commented 3 years ago

The last round of changes look good to me. I'll have time tomorrow to do a final review.

Sara-ShiHo commented 3 years ago

@ajstewartlang I have checked off the last few things on my list. Everything looks good to me. Thank you.

ajstewartlang commented 3 years ago

Thanks @Sara-ShiHo. @pmyteh can I check you're getting on ok with your review?

ajstewartlang commented 2 years ago

@jiemakel many apologies for the delay in progressing your submission - I've been trying to line up another reviewer so that we have two completed reviews in place. I've just sent another invite out this morning...

ajstewartlang commented 2 years ago

@whedon add @GaurangTandon as reviewer

whedon commented 2 years ago

OK, @GaurangTandon is now a reviewer

GaurangTandon commented 2 years ago

Hi, I am happy to review the paper.

For now, my primary concern is of automated tests. I read your response above (https://github.com/openjournals/joss-reviews/issues/3504#issuecomment-909157981). You mentioned that:

However, that would have required including HTML examples of such articles as part of the repo, and that could have brought IPR concerns

I understand that there are two components to this project: the scraper (which downloads content using API) and the parser (which extracts structured data from the HTML). I believe the authors can still write valid tests for the parser without violating IPR concerns, by simply replacing the original text with some other text (for example, replacing the actual news headline with "This is a news headline" or the body with "This is a news article body"). The parser's primary concern is of the HTML structure whereas the IPR's primary concern is of the news content, and both of these seem to me as mutually satisfiable.

In fact, one can even write a cronjob to "refresh" these HTML files regularly by downloading latest HTML files from the news outlets, and making appropriate replacements to please IPR concerns (would require writing a separate script, though)

Overall, yes, this testing framework would be a bit unusual or perhaps clunky. I don't think not having automated tests are a show-stopper for this project, as there is defensive logging in place as well, but I would appreciate if there were some tests regardless.

I would love to hear what the authors think about this. I still haven't read the project fully so my analysis maybe mistaken, humble apologies if that's the case!

I will follow up with a detailed review soon. Cheers!

GaurangTandon commented 2 years ago

Hi, in addition to my comment about automated tests above, I have also added some other comments in the main GitHub issue (https://github.com/openjournals/joss-reviews/issues/3504#issue-946176417) about a broken functionality and two textual incoherencies that I think should be addressed.

Now, I'm only left with analysing the functionality of the other three components in this project, I'll analzye them after receiving author's response on the above comments that I have already posted.

Please feel free to reach out to me on email if you want to chat in detail about any of my comments. Thanks!

jiemakel commented 2 years ago

fms-query-il has now been updated to not require a --to-date parameter which was what caused the TypeError.

On the issue of bypassing IPR concerns by modifying the files for automated tests, I agree. We'll work on adding tests to cover examples of the different HTML formats.

jiemakel commented 2 years ago
ajstewartlang commented 2 years ago

Just checking in with you @GaurangTandon to see how the review of this submission is progressing?

GaurangTandon commented 2 years ago

Hi extremely sorry for the delay, I was busy with academics. I will get back to you on this by today itself.

GaurangTandon commented 2 years ago

Hi:

  1. The automated testing is clean. Thanks for implementing it!
  2. fms-query-il works now as per your bugfix.

Overall, I believe all functional and detail checkpoints are complete :heavy_check_mark:

There's one minor issue though: I think the software paper is perhaps too short. Maybe it has not yet been updated and I'm looking at the older version. https://github.com/openjournals/joss-papers/blob/joss.03504/joss.03504/10.21105.joss.03504.pdf I have highlighted my concerns with it in my review checklist in the parent issue post.

Regardless, the overall GitHub repository itself is complete and comprehensive. Can @ajstewartlang you shed some more light on if these missing points in the software paper are relevant enough or can they be skipped?

Barring this point, my review is complete. Thank you for offering me the opportunity and I wish good luck to the project authors! :-)

UMTti commented 2 years ago

Hi @GaurangTandon, I added the latest version of the paper here, you have been reading an older version

GaurangTandon commented 2 years ago

@UMTti Thank you for the updated link! I read it briefly and it does answer all my questions ✅ Most of its content seems very similar to the repo readme which I'd already read.

@ajstewartlang My review is now complete. Thank you all for the review opportunity, and I wish the team good luck :-)

ajstewartlang commented 2 years ago

@UMTti Can you confirm that the latest version of the paper is this one please? https://github.com/hsci-r/finnish-media-scrapers/blob/master/paper.md

jiemakel commented 2 years ago

@ajstewartlang it is.

ajstewartlang commented 2 years ago

@whedon generate pdf

ajstewartlang commented 2 years ago

@whedon check references

whedon commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

ajstewartlang commented 2 years ago

@whedon remove @pmyteh as reviewer

whedon commented 2 years ago

OK, @pmyteh is no longer a reviewer

ajstewartlang commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

ajstewartlang commented 2 years ago

Many thanks for your helpful and detailed reviews, @Sara-ShiHo and @GaurangTandon - and for a great submission @UMTti

If you could now do the following please, that would be great:

jiemakel commented 2 years ago

DOI: 10.5281/zenodo.5796453 DOI

ajstewartlang commented 2 years ago

@whedon set 10.5281/zenodo.5796453 as archive

whedon commented 2 years ago

OK. 10.5281/zenodo.5796453 is the archive.

ajstewartlang commented 2 years ago

@whedon set v1.1.1 as version

whedon commented 2 years ago

OK. v1.1.1 is the version.

ajstewartlang commented 2 years ago

@whedon recommend-accept

whedon commented 2 years ago
Attempting dry run of processing paper acceptance...
whedon commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 2 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2843

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2843, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true
arfon commented 2 years ago

@whedon accept deposit=true

whedon commented 2 years ago
Doing it live! Attempting automated processing of paper acceptance...