ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

healthforum: A R package for scraping health forum discussion threads #349

Closed LingshuHu closed 2 years ago

LingshuHu commented 5 years ago

Submitting Author Name: Lingshu Hu Submitting Author Github Handle: !--author1-->@LingshuHu<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) !--author-others-->@mkearney<!--end-author-others-- Repository: https://github.com/LingshuHu/healthforum
Version submitted: 0.1.0
Editor: TBD Reviewers: TBD

Archive: TBD
Version accepted: TBD


Package: healthforum
Type: Package
Title: Scrape Patient Forum Data
Version: 0.1.0
Authors@R: c(
    person("Lingshu", "Hu", ,
      email = "lingshu.hu@hotmail.com", role = c("aut", "cre"),
      comment = c(ORCID = "0000-0003-0304-882X")), 
    person("Michael W.", "Kearney", ,
      email = "kearneymw@missouri.edu", role = c("ctb"),
      comment = c(ORCID = "0000-0002-0730-4694")))
Description: Scrape data from Patient Forum <https://patient.info/forums> by entering urls. It will return a data frame containing text, user names, like counts, reply counts, etc.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Imports: 
    rvest,
    magrittr,
    xml2,
    purrr,
    tokenizers,
    stringr,
    tibble
Depends: R (>= 3.5.0)
RoxygenNote: 6.1.1
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown
VignetteBuilder: knitr

Scope

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options - [x] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [x] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [x] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

maelle commented 5 years ago

Thanks for your submission @LingshuHu! We editors are discussing. We are specifically considering the privacy implications of this package. If you have any input on this topic, please let us know.

LingshuHu commented 5 years ago

Dear Maëlle,

Thank you for your feedback! In terms of privacy, 1) this package only collects publicly available data from the website. On this website, users need to fill out their names, nicknames, and other information. Users' real names are not visible, and users have control of showing information such as "about me," "my groups," "my replies," and "my discussions." This package only scrapes users' nicknames and other publicly visible information. By using users' nicknames, people cannot infer the URLs of users' profile pages and get their personal information. 2) We have added a responsibility disclaimer in README and vignette. 3) We are open to changes if reviewers suggest additional things to be done.

Please let us know if you have any other questions or comments. Thank you for your time and consideration!

Warm regards, Lingshu

maelle commented 5 years ago

:wave: @LingshuHu! Sorry for the delay, we're not forgetting this submission: we're in the process of formulating a general policy on this so we can be consistent and provide guidance to reviewers on the subject. Thanks for your patience!

LingshuHu commented 4 years ago

Dear Maëlle,

I hope all is well. I would just like to check the status of our paper. Is there any further information that we could provide or any improvement that we can make? Thank you!

Regards, Lingshu

noamross commented 4 years ago

Dear @LingshuHu, sorry this took a while to come back to. In the end we consulted with some bioethics and social media researchers and came up with a new Ethics and Privacy policy. You can preview it here and comment here.

noamross commented 4 years ago

Based on this approach, here are my thoughts on the healthforum package: It clearly accesses generates personally identifiable and sensitive data. Looking at all the relevant ToC/privacy/etc on the site, the normative expectations of privacy are quite ambiguous. I think any research using the package would require a form of informed consent or a evaluation of such expectation, be it discussion with forum managers or a survey of users. Since this is pretty much any use of the package, I think it makes sense to ask you, the package authors, to do this so as to provide appropriate information to users. This would mean contacting the site administrators to get their guidance on the use of the package and include that prominently in documentation. It would also make sense to feature workflows in the vignette that removed personally identifying information. For instance, you could show an example where you pull text and then generate analysis of frequently used terms, prominently noting that no user-level data is retained and this would be appropriate for publishable analysis.

mkearney commented 4 years ago

@noamross:

I believe the healthforum package only has access to screen names. So to clarify: is the policy of ropensci that all 'user names' are considered to be personally identifiable data?

The GDPR defines personal data as

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

I think the above paragraph could be interpreted that way, but my understanding is that an online identifier counts as personal data if it can be used to identify a natural person. This makes sense if the user names are email addresses. Or if user names are used in combination with IP addresses or linked to actual names. But I worry what kind of effect the broader interpretation would have on research–or what it means to software like {rtweet} that was accepted in the past.

noamross commented 4 years ago

I responded over at https://github.com/ropensci/dev_guide/pull/251/ so as to consolidate conversation.

noamross commented 4 years ago

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

noamross commented 4 years ago

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

annakrystalli commented 4 years ago

⚠️⚠️⚠️⚠️⚠️ In the interest of reducing load on reviewers and editors as we manage the
COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned.
Reviews and responses to reviews will be handled on a 'best effort' basis, but
no follow-up reminders will be sent. Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great
appreciation for the work of our authors and reviewers. Stay healthy and take
care of one other.

The rOpenSci Editorial Board ⚠️⚠️⚠️⚠️⚠️

noamross commented 4 years ago

Hello @LingshuHu annd @mkearney, my apologies that this had fallen somewhat through the cracks somewhat without resolution. We started review activities up a few weeks ago but I failed to pick up this conversation as it had spread across multiple repositories.

We adopted the policy that we discussed above and it is at https://devguide.ropensci.org/policies.html#ethics-data-privacy-and-human-subjects-research . Our take on healthforum is that, since the vast majority of research uses would require users to obtain a form of informed consent, the package authors should facilitate this by obtaining either blanket approval from patients.info, or, perhaps more realistically, appropriate contact info and a procedure for engaging with them and the user community and document this (e.g., "For informed consent procedures for using patients.info forum data, contact XXXX, community manager.")

Please let us know what your status is. Again, sorry this took so long to pick up again.

LingshuHu commented 4 years ago

Dear @noamross ,

Thank you for updating! We really appreciate your work during this hard time! We think providing users with contact information would be a great idea. We have included it in README.Rmd and also remind users to contact their local IRBs to get more detailed information about privacy policies. We also created a package startup message containing this information. Whenever users library our package, they will see it.

Could you please let us know what we should do next? Would you suggest that we release a new version of the CRAN package first?

noamross commented 4 years ago

Dear @LingshuHu, my deep apologies that I did not respond to this previously. Packages that had been underway during our "pause" fell through my alerts.

As I had written in my comment above, though, our opinion is that a disclaimer as included in the README isn't sufficient here, because, since the predominant uses of this package would require an informed consent procedure, doing the legwork of contacting the patients.info and obtaining either blanket approval or at least the relevant contact info and procedure, as I described above, is appropriate.

We realize this is a high standard and would be open to another option, such as including an IRB approval showing how certain uses can be approved without direct engagement with patients.info.

noamross commented 3 years ago

Dear @LingshuHu I wanted to ping to check if you are planning on continuing this review given our response above. I've placed the "holding" tag on it for now.

LingshuHu commented 3 years ago

Dear @noamross Thank you for reaching out to me! Yes, we want to continue to work on it. We plan to apply for an IRB review. But currently, I'm occupied by my dissertation and graduation. I will let you know when we get a chance to go through the IRB review.

emilyriederer commented 2 years ago

Hi @LingshuHu ! We are doing a sweep of stale review issues. Since this review has been open and inactive for so long, much may have changed including author, editor, and reviewer bandwidth and ever-evolving rOpenSci best practices. As such, I'm closing this issue. If you still have interest and capacity, we would welcome you to open a new submission issue!