Closed whedon closed 2 years ago
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @sara-shiho, @pmyteh it looks like you're currently assigned to review this paper :tada:.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Software report (experimental):
github.com/AlDanial/cloc v 1.88 T=0.04 s (910.5 files/s, 53030.6 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 18 242 194 911
YAML 4 7 12 92
Markdown 3 55 0 85
reStructuredText 4 45 77 53
TOML 1 5 0 52
TeX 1 2 0 35
DOS Batch 1 8 1 26
make 1 4 7 9
-------------------------------------------------------------------------------
SUM: 33 368 291 1263
-------------------------------------------------------------------------------
Statistical information for the repository '20911ab5d5684f4c1f1be61c' was
gathered on 2021/07/16.
The following historical commit information, by author, was found:
Author Commits Insertions Deletions % of changes
Eetu Mäkelä 44 2724 2444 77.33
Pihla Toivanen 38 1291 224 22.67
Below are the number of rows from each author that have survived and are still
intact in the current revision:
Author Rows Stability Age % in comments
Eetu Mäkelä 1289 47.3 0.7 7.60
Pihla Toivanen 58 4.5 2.0 6.90
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
:wave: @sara-shiho, please update us on how your review is going (this is an automated reminder).
:wave: @pmyteh, please update us on how your review is going (this is an automated reminder).
So far, I have a few questions/suggestions for the authors:
.DS_Store
should be removed from the repo and be added to .gitignore
In commit https://github.com/hsci-r/finnish-media-scrapers/commit/901bbf522c7451da4928ba4724a69d25ff7605dd, I've tried to answer the questions/suggestions given by @Sara-ShiHo .
More specifically beyond what is now added to the README:
Hi @Sara-ShiHo and @pmyteh just wondering how your reviews are going?
I just looked through jiemakel's most recent commit & updated my reviewer checklist. I'm going through the functionality right now; I might have some more suggestions/questions.
First off, thank you @jiemakel for your detailed revision. I have a few more suggestions.
Finally, I think the paper is well-written and doesn't need change, but I thought the README documentation for HS could be reorganized like this (although feel free to do differently if you disagree).
This scraper requires a log-in email and password for Helsingin Sanomat. Sign up at https://www.hs.fi/ and add them to the command
fms-fetch-hs
. For examplefms-fetch-hs -i hs-sdp.csv -o hs-sdp -u <email> -p <password>
.Technically, the
fms-fetch-hs
command usespyppeteer
to control a headless Chromium browser to log in and ensure the dynamically rendered content in HS articles is captured. To ensure a compatible Chromium, when first running the tool,pyppeteer
will download an isolated version of Chromium for itself, causing some ~150MB of network traffic and disk space usage.After fetching the articles, extract texts with
fms-html-to-text
e.g.fms-html-to-text-hs -o hs-sdp-output hs-sdp
.
I've done the following additions to the repository as @Sara-ShiHo suggested:
Hi @Sara-ShiHo and @pmyteh - I'm just checking in to see how your reviews are progressing - could you provide an update please?
The last round of changes look good to me. I'll have time tomorrow to do a final review.
@ajstewartlang I have checked off the last few things on my list. Everything looks good to me. Thank you.
Thanks @Sara-ShiHo. @pmyteh can I check you're getting on ok with your review?
@jiemakel many apologies for the delay in progressing your submission - I've been trying to line up another reviewer so that we have two completed reviews in place. I've just sent another invite out this morning...
@whedon add @GaurangTandon as reviewer
OK, @GaurangTandon is now a reviewer
Hi, I am happy to review the paper.
For now, my primary concern is of automated tests. I read your response above (https://github.com/openjournals/joss-reviews/issues/3504#issuecomment-909157981). You mentioned that:
However, that would have required including HTML examples of such articles as part of the repo, and that could have brought IPR concerns
I understand that there are two components to this project: the scraper (which downloads content using API) and the parser (which extracts structured data from the HTML). I believe the authors can still write valid tests for the parser without violating IPR concerns, by simply replacing the original text with some other text (for example, replacing the actual news headline with "This is a news headline" or the body with "This is a news article body"). The parser's primary concern is of the HTML structure whereas the IPR's primary concern is of the news content, and both of these seem to me as mutually satisfiable.
In fact, one can even write a cronjob to "refresh" these HTML files regularly by downloading latest HTML files from the news outlets, and making appropriate replacements to please IPR concerns (would require writing a separate script, though)
Overall, yes, this testing framework would be a bit unusual or perhaps clunky. I don't think not having automated tests are a show-stopper for this project, as there is defensive logging in place as well, but I would appreciate if there were some tests regardless.
I would love to hear what the authors think about this. I still haven't read the project fully so my analysis maybe mistaken, humble apologies if that's the case!
I will follow up with a detailed review soon. Cheers!
Hi, in addition to my comment about automated tests above, I have also added some other comments in the main GitHub issue (https://github.com/openjournals/joss-reviews/issues/3504#issue-946176417) about a broken functionality and two textual incoherencies that I think should be addressed.
Now, I'm only left with analysing the functionality of the other three components in this project, I'll analzye them after receiving author's response on the above comments that I have already posted.
Please feel free to reach out to me on email if you want to chat in detail about any of my comments. Thanks!
fms-query-il
has now been updated to not require a --to-date
parameter which was what caused the TypeError
.
On the issue of bypassing IPR concerns by modifying the files for automated tests, I agree. We'll work on adding tests to cover examples of the different HTML formats.
Just checking in with you @GaurangTandon to see how the review of this submission is progressing?
Hi extremely sorry for the delay, I was busy with academics. I will get back to you on this by today itself.
Hi:
fms-query-il
works now as per your bugfix.Overall, I believe all functional and detail checkpoints are complete :heavy_check_mark:
There's one minor issue though: I think the software paper is perhaps too short. Maybe it has not yet been updated and I'm looking at the older version. https://github.com/openjournals/joss-papers/blob/joss.03504/joss.03504/10.21105.joss.03504.pdf I have highlighted my concerns with it in my review checklist in the parent issue post.
Regardless, the overall GitHub repository itself is complete and comprehensive. Can @ajstewartlang you shed some more light on if these missing points in the software paper are relevant enough or can they be skipped?
Barring this point, my review is complete. Thank you for offering me the opportunity and I wish good luck to the project authors! :-)
Hi @GaurangTandon, I added the latest version of the paper here, you have been reading an older version
@UMTti Thank you for the updated link! I read it briefly and it does answer all my questions ✅ Most of its content seems very similar to the repo readme which I'd already read.
@ajstewartlang My review is now complete. Thank you all for the review opportunity, and I wish the team good luck :-)
@UMTti Can you confirm that the latest version of the paper is this one please? https://github.com/hsci-r/finnish-media-scrapers/blob/master/paper.md
@ajstewartlang it is.
@whedon generate pdf
@whedon check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@whedon remove @pmyteh as reviewer
OK, @pmyteh is no longer a reviewer
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Many thanks for your helpful and detailed reviews, @Sara-ShiHo and @GaurangTandon - and for a great submission @UMTti
If you could now do the following please, that would be great:
@whedon set 10.5281/zenodo.5796453 as archive
OK. 10.5281/zenodo.5796453 is the archive.
@whedon set v1.1.1 as version
OK. v1.1.1 is the version.
@whedon recommend-accept
Attempting dry run of processing paper acceptance...
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1177/1464884920985724 is OK
- 10.1111/j.1460-2466.2011.01596.x is OK
MISSING DOIs
- None
INVALID DOIs
- None
:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.
Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/2843
If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/2843, then you can now move forward with accepting the submission by compiling again with the flag deposit=true
e.g.
@whedon accept deposit=true
@whedon accept deposit=true
Doing it live! Attempting automated processing of paper acceptance...
Submitting author: @jiemakel (Eetu Mäkelä) Repository: https://github.com/hsci-r/finnish-media-scrapers Version: v1.1.1 Editor: @ajstewartlang Reviewers: @sara-shiho, @GaurangTandon Archive: 10.5281/zenodo.5796453
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@sara-shiho & @pmyteh, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @ajstewartlang know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Review checklist for @sara-shiho
Conflict of interest
Code of Conduct
General checks
Functionality
[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
Documentation
pip install
Software paper
Review checklist for @pmyteh
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @GaurangTandon
Conflict of interest
Code of Conduct
General checks
Functionality
--help
works as expectedfms-query-il
is broken it seems, rest of the tools work (I ran this command:fms-query-{yle|hs|il|is} --from-date 2021-01-01 -q 'parliament' -l 10 -o results2.csv
). I got aTypeError
for-il
.Documentation
pip
is used to install as an end user andpoetry
as a developer, both are pretty standard.members_of_parliament
folder is available.Software paper