Open lwjohnst86 opened 2 years ago
Hi! In 'Brainstorm some functions/tasks that the package will do', could it be useful with an English-American language converter (suggestion from a participant at the R course), or is that out of scope?
Yes totally sounds like a cool idea!!
Hey everyone involved in this R package! I just wanted to let you know I'm looking into making a function that can translate text from English-American and vice versa. If you have any inputs or suggestions, just comment here :)
I have been a bit on and off during the day, but in the end I have tried doing some "brainstorming". Here are some of my thoughts:
That is what my brain managed to spew out today. I don't know where to go from here, but if people think anything could make a good package I can try to flesh it a bit more out before our next working session.
@AndersAskeland Amazing ideas! I am also totally keen on Idea 2!! Perhaps idea 3 and 4 could be put together? I also find 1 quite handy :)
Excellent ideas! @AndersAskeland
I set up a package repository so we had a place to store our thoughts besides this thread, when our big-brain ideas grow too big for it :) Also to have a place to keep our raw reflection logs, which can be found here: https://github.com/science-collective/package/tree/master/doc/reflections (if you guys adds yours too). Mine is kind of a brainstorm at the moment. After some cleanup they'll probably get moved to the vignettes section at some point (I couldn't find that section in any of the other repos, feel free to move it as needed @lwjohnst86 )
I'm commuting to Copenhagen (SDCC) and was hoping to join you guys on Discord today, but my internet connectivity here is horrible.
I've been wanting to look at stuff to enable a full end-to-end reproducible scientific workflow on "secure" offline servers, such as Statistics Denmark. For the most part, it's not too bad. They've downloaded and installed all of CRAN and Git, so all packages are available, and you can use version control. And they're quick to update packages or R/RStudio versions if you need them to.
What I found the most frustrating thing when writing my paper on Statistics Denmark's servers was having to type in your bibliography BY HAND! (you can't even copy-paste). Nobody should be forces to do that - ever π
So, I've been looking at ways of creating an offline database of pubmed citations in a .bib-friendly format. For starters, I envision a minimalistic package that simply contains an attached object with citation strings for every paper ever published in a PubMed-indexed journal. Or maybe by decade or year, if size becomes a problem. I imagine being able to search the citations with regex patterns.
There's already the easyPubMed-package that queries PubMed citation (https://cran.r-project.org/web/packages/easyPubMed/vignettes/getting_started_with_easyPubMed.html), but that won't work in an offline environment. I hope to create something similar, but with less functionality, as I don't intended the tool to be used to search the literature, but just as a way to facilitate citation of the papers you've already found.
The first step would be to download the MEDLINE/PubMed citation library via their FTP, then process it to scrape away unneccessary stuff, e.g. abstracts. And then save the remaining content in a file/format/object that can be embedded in an R package and read by R. The main issue is that it's a huge library, so CRAN is not going to like the size of the attached file, even when it has been scraped down to just the barebones citation. Using a short-text-compression algorithm like Brotli (https://cran.r-project.org/web/packages/brotli/vignettes/benchmarks.html) to compress/decompress the file should reduce the size to roughly 20% of the initial size, with high decompression speed at the user-level. Another way might be to process the citation string to a data frame format (e.g. variables for title, author1, author2, authorN, journal, year, volume, doi etc.) which R should compress even more efficiently (all variable will have a limited number of levels except for the title and doi link). Should be a fun exercise in regex'es.
Again, splitting the package into separate packages for each decade/year/field is an option to reduce the size of each package and increase user-level performance (less data to open/decompress = faster). Failing that, that package could just reside outside of CRAN, e.g. on GitHub, and Statistics Denmark or other server managers can be asked to download it from there.
What do you think? Sound like a painful fun project, right? π
A great idea!
One could perhaps use the PubMed API (https://pubmed.ncbi.nlm.nih.gov/download/) on install to download data. That way you would not have to include data in any package files. A added benefit would be that the package could update the article database without having to update the package.
However, there might be an bottleneck related to database lookup, wherein I think R might struggle searching large files directly (i.e. a large .bib file). I am unsure if R provides good tooling for databases, but I imagine one would need to store the data in some sort of relational database (sql'ish), and generate smaller .bib files based on lookup. I think dbplyr could work, but I am unsure.
Yea, really cool idea. One difficulty would be that CRAN doesn't allow data files to be larger than 5Mb, so that puts a major limitation to uploading to CRAN and getting access via the server. I like the idea of having it as a GitHub only package. What are DST's policies regarding that?
Alternatively (and this isn't about making a package), you could write out the bib citation key in the Markdown file and when you download it from DST, knit it outside DST. :shrug:
A great idea!
One could perhaps use the PubMed API (https://pubmed.ncbi.nlm.nih.gov/download/) on install to download data. That way you would not have to include data in any package files. A added benefit would be that the package could update the article database without having to update the package.
However, there might be an bottleneck related to database lookup, wherein I think R might struggle searching large files directly (i.e. a large .bib file). I am unsure if R provides good tooling for databases, but I imagine one would need to store the data in some sort of relational database (sql'ish), and generate smaller .bib files based on lookup. I think dbplyr could work, but I am unsure.
Yeah, I just realized CRAN is even more restrictive on package size than I thought (5MB limit), so embedding the citations is off the table there. This leaves two options: 1) have the package look up the citations. This is similar to the easyPubMed-package, which queries the PubMed API. It might still provide some added benefits/performance compared to easyPubMed if we can make a more barebones solution. 2) Download and clean the data, and put the data/package on GitHub or some other hosting service, and ask DST to download it.
I don't know if DST can be asked to download the whole PubMed/MEDLINE library and put it on their network drive. In that case the package could just be directed to the local folder. Alternatively, with the right hosting, maybe the two options can be combined, so the package looks up the barebones/cleaned citations (makes the package fit on CRAN), and the citations can also be downloaded and hosted locally and the package can be directed to look them up there (makes it accessible to offline environments). Downside is that the contents would need to be updated/downloaded again routinely.
Don't know if the purpose is too niche to justify a solution. I'll try to look into it, at least.
I spent half a day reading up on xml files and starting a repo: https://github.com/Aastedet/pubmedciteR
I think it's doable, but the pubmed xml-files are tricky. I'll see if I can put a few more hours into it, then it should be possible to at least create an R object with citations.
General aim: Build an R package(s) that automates or streamlines some basic setup, open science, reproducibility, general workflow, and organizational tasks.
Tasks to do (2022-01-17 session)
Before session
During session
Assign yourself to one of these tasks that you want do to/work on.
use_r('FILENAME')
, create a function inside and add Roxygen documentation). Refer to the R Packages chapter on R code for more help.vignettes/articles/reflections/YOURNAME.md
file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.Tasks to do (2022-02-21 session)
Assign yourself to one of these tasks that you want do to/work on.
vignettes/reflections/YOURNAME.md
file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.