Improve the documentation for `tidybulk`

stemangiola commented 1 year ago

improving examples for methods
solving documentation warnings (when packages are loading or installing)
Improving methods' argument description
improving the transparency of the method used in the backend in the description of the methods

aliosmancetin commented 1 year ago

Hi, I'd like to contribute this one, but I am not sure how it should be done or when is the deadline. I am a medical doctor but currently a PhD student as a computational biologist of my group. I know transcriptomics analysis with bioconductor and I like tidy paradigm. Since this is labeled as a "good first issue", if there will be enough guidance, I can give it a try, what do you think?

stemangiola commented 1 year ago

Thanks @aliosmancetin! Nice to meet you.

I am not sure when is the deadline

There is a soft deadline to commit and start working, so we can form the team for the upcoming paper. But there is no hard deadline for completing the task. We trust the community on that side. Of course, each challenge should not drag too long, so to keep the ball rolling.

I am not sure how it should be done

Assign yourself to this issue,
Read about Roxigen2 https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html
Fork tidybulk
Look at all available methods in the R/methods.R script
For each of those methods, look at the documentation (e.g. ?test_differential_abundance)
For each argument in each method, judge if it is very, very easily understandable; if the answer is not (as I expect for the majority of them), try to improve it.
Is the output structure and class clearly described in the documentation? If not, you can improve that.
Are examples covering most argument combinations and use cases, if not you can add more examples

Then, other more challenging tasks are

solving documentation warnings (when packages are loading or installing)
Make sure that the backend code for each method is pasted and described in the @description part of the documentation to give transparency of what is happening underneath.
For all analysis methods, create messaging framework, to communicate to the user the operations that are running below (e.g, "tidybulk says: normalising with TMM", "tidybulk says: estimating .. with xxx")

Let me know if you want more guidance. @HelenaLC might be a good code reviewer, as she is an expert on R packages and standards.

aliosmancetin commented 1 year ago

Nice to meet you too! Thanks for your comments and outline. It seems doable for me, I'll start asap.

aliosmancetin commented 1 year ago

I couldn't figure out how to assign myself. Should ".take" work or not?

stemangiola commented 1 year ago

I assigned you. Not sure if users can choose their name from the drop-down menu

Looking forward to you PR!

stemangiola commented 1 year ago

Hello @aliosmancetin , just to do some planning on our side.

Do you have a timeline already to work on this challenge?

A note: for the tidy adapters, please follow the tidySingleCellExperiment package, which was very recently revamped.

aliosmancetin commented 1 year ago

Hi @stemangiola,

If you refer very structured work-plan with "timeline", no, I don't have unfortunately 😅 But, I've already started to read must-read documents. I am willing to commit some work on this issue, but I am not sure how long it takes.

stemangiola commented 1 year ago

Hi @stemangiola,

If you refer very structured work-plan with "timeline", no, I don't have unfortunately 😅 But, I've already started to read must-read documents. I am willing to commit some work on this issue, but I am not sure how long it takes.

I assigned you prematurely. Now I have emptied the assignee field, so you can assign yourself when you will be ready, with no rush :)

aliosmancetin commented 1 year ago

I assigned you prematurely. Now I have emptied the assignee field, so you can assign yourself when you will be ready, with no rush :)

Okay, let me focus on this one more week. Then we can discuss my progress and whether it's going to work or not.

stemangiola commented 1 year ago

@GrootJ welcome to this issue! As this issue is multifaceted, we can spread the work with other devs (@aliosmancetin potentially).

Please have a look at the description for the various goals. Sure, we can have a short Zoom now or any other time. I will be in NY until tomorrow

aliosmancetin commented 1 year ago

Hi @stemangiola,

I saw you assigned me again. I think I figured out how generally tidybulk and tidySummarizedExperiment work but I have couple of questions and I need some confirmations if I get this right. Then I could start improving the documentation. Do you prefer doing this under this issue or with email (questions may sound easy and basic, I don't know).

stemangiola commented 1 year ago

I was reorganising things and I figured out it was more informative to put potential contributors anyway :)

feel free to ask here. So other contributors will be able to learn as well.

GrootJ commented 1 year ago

@GrootJ welcome to this issue! As this issue is multifaceted, we can spread the work with other devs (@aliosmancetin potentially).

Please have a look at the description for the various goals. Sure, we can have a short Zoom now or any other time. I will be in NY until tomorrow

@stemangiola - got more set up, about to start on this - i will follow suggestions you provided to @aliosmancetin. I forked tidybulk (main) - and opened R/methods.R and went through the readme I also want to run through the README.Rmd workflow myself to make sure functions do/what steps they are used. running README.Rmd seems to require a fork of dev as well? (please note I am newer to the development aspects here) - open to suggestions either way - happy to connect, also on how to split up documentation editing amongst us..

cheers, Joost

stemangiola commented 1 year ago

Thanks @GrootJ and @aliosmancetin, for the enthusiasm! tidybulk really needed this. Welcome to the team and to the publication ;)

We can divide the tasks so @GrootJ and @aliosmancetin can pick and work independently.

If possible, do this with the perspective of a new user who should find everything intuitive and clear and should feel tidybulk as a black box (a thing that more experienced people often criticise). So tidybulk can be not only convenient but also an educational tool.

1) relevant for dplyr_methods.R, tidyr_methods.R Clean dplyr, tidyr methods replicating the new tidySingleCellExperiment roxygen style that @HelenaLC did. 2) relevant for methods.R Improve roxygen documentation (arguments names, examples across all possible backend methods) of the analysis methods. 3) relevant for methods.R, as SE_methods.R use the same method list Improve the analysis methods roxygen description, exposing some of the backend code (see what I did for test_differential_abundance, that can also be improved), so users are aware of what doce functions (e.g. deseq2, edger, GSEA, cibersort, etc..) are used. This would include references to papers on the methods in the documentation section. 4) relevant for methods.R and SE_methods.R, as both need messaging system Improve/create a messaging system inside analysis methods that update the user on what is going on underneath. We need to choose a messaging function. (message() I think prints just at the end, so it would not be suitable. cat() cannot be easily suppressed, so it is not suitable. But there is something I might be missing). The messages would be of this nature:

tidybulk says: scaling using TMM, or
tidybulk says: scaling using upper quartile, and
tidybulk says: calculating mean-variance trend using ...
tidybulk says: fitting using ...
tidybulk says: testing gene set enrichment using ...

You can put your name next to 1, 2, 3, 4 so we know what we are working on.

aliosmancetin commented 1 year ago

Now this task is more doable for us, welcome @GrootJ :)

@stemangiola At this point, my question is, should we focus on "methods.R" or as well as "methods_SE.R". As far as I understand, methods directly work on SummarizedExperiment should be more important now.

Other question, you refer to "analysis methods". Are those in "methods.R", "methods_SE.R" files or "functions.R", "functions_SE.R" files?

I may have a suggestion about documentation, we may include a kind of "developer guide" to documentation. Maybe experienced users/developers can easily understand how tidybulk works (classes, generics, methods and backend implementation etc.) however, it took me some time to understand how it works exactly (after I saw there are feature__ and sample__ variables directly in utilities, it became easier :) ). At this point, I know tidyadapters generally work similarly, maybe other developers are working on this right now, I am not sure.

stemangiola commented 1 year ago

should we focus on "methods.R" or as well as "methods_SE.R".

Good point, I edited the to-do list specifying which files they apply to

As far as I understand, methods directly work on SummarizedExperiment should be more important now.

I would not say so. Still, we have to give equal importance to the tibble and SummarizedExperiment interfaces. They should behave consistently. (bare in mind that methods_SE.R does not have documentation and relies on methods.R)

Other question, you refer to "analysis methods". Are those in "methods.R", "methods_SE.R" files or "functions.R", "functions_SE.R" files?

"methods.R", "methodsSE.R" (I specified now in the to-do list). `functions*.R` are only internal, definitely a lower priority. Although documented code (even if internal) will be appreciated by future developers.

I may have a suggestion about documentation, we may include a kind of "developer guide" to documentation. Maybe experienced users/developers can easily understand how tidybulk works (classes, generics, methods and backend implementation etc.) however, it took me some time to understand how it works exactly (after I saw there are feature__ and sample__ variables directly in utilities, it became easier :) ). At this point, I know tidyadapters generally work similarly, maybe other developers are working on this right now, I am not sure.

Amazing point. This is definitely a new/different "#tidyomics open challenge". I see this to be developed as a blog post and vignette as a guide. You @aliosmancetin and @GrootJ are great judges of the most confusing, less documented and intuitive aspects of the code base, on the front and back end. You could open a challenge/issue, and start listing all confusing stuff that you are figuring out with sweet and tiers, and that can be the material for a vignette/blog post.

maybe other developers are working on this right now, I am not sure.

None has tackled the developer side, but it is a good idea.

stemangiola commented 1 year ago

@aliosmancetin and @GrootJ when you manage to do your first commit, please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing

stemangiola commented 1 year ago

Hello Fellas,

@chilampoon would be on their way to resolve the conflicts of tidybulk for tidyomics package. (fixing roxigen for dplyr_methods.R and tidyr_methods.R)?

Is any of you (@aliosmancetin and @GrootJ ) actively working on this? I don't want to create duplications.

aliosmancetin commented 1 year ago

Hi @stemangiola,

I am not working on solving the conflicts.

GrootJ commented 1 year ago

Hi @stemangiola, others - I am not working on resolving those conflicts either (item 1 of your list above).

I am making some headway to get a bit better grip on some of the ongoing conversations you guys have (around tidybulk package development, tidiness in omics goals in general). I test-ran 2 tidybulk workflows (README.Rmd with se_mini dataset and diff feature abundance from the manuscript with the pasilla dataset) which gave me a better idea of the functions and documentation in there.

I think I see the opportunities for documentation improvement you mentioned (item 3 of your list above). E.g. adjust_abundance function documentation lists inputs and outputs but does not mention batch correction (using sva which it seems to auto-install+load) it seems to perform (between single and paired reads - type in the pasilla dataset). Is this is an example of documentation you like to get clarified? How to start writing such improvement also depends of on from which new user's perspective it would need to be made "easily" understandable; new users w basic familiarity with RNA-Seq and tidyverse?

getting the backend code exposed as you suggested sounds like I plan (so we can figure out what is under the hood)
getting references to source methods used in the function sounds like a plan
perhaps reference or even descriptions from those source methods (e.g. as with combat?)

Probably good to touch base on this (virtual chat or zoom?) and how we wil go about that (in version control, who which functions etc). i still need to get more familiar with roxigen too (not sure if that would fit the timeline for your publication?)

stemangiola commented 1 year ago

E.g. adjust_abundance function documentation lists inputs and outputs but does not mention batch correction (using sva which it seems to auto-install+load) it seems to perform (between single and paired reads - type in the pasilla dataset). Is this is an example of documentation you like to get clarified?

Yes, correct

How to start writing such improvement also depends of on from which new user's perspective it would need to be made "easily" understandable; new users w basic familiarity with RNA-Seq and tidyverse?

Very basic knowledge. But we should not rely on any jargon or abbreviation. Basically the English dictionary should be enough to understand tidybulk and the underlying methods.

getting the backend code exposed as you suggested sounds like I plan (so we can figure out what is under the hood)

getting references to source methods used in the function sounds like a plan

perhaps reference or even descriptions from those source methods (e.g. as with combat?)

Agree

Probably good to touch base on this (virtual chat or zoom?) and how we wil go about that (in version control, who which functions etc). i still need to get more familiar with roxigen too (not sure if that would fit the timeline for your publication?)

Sure. Where are you based? I could do it in 2 hours from now.

GrootJ commented 1 year ago

E.g. adjust_abundance function documentation lists inputs and outputs but does not mention batch correction (using sva which it seems to auto-install+load) it seems to perform (between single and paired reads - type in the pasilla dataset). Is this is an example of documentation you like to get clarified?

Yes, correct

How to start writing such improvement also depends of on from which new user's perspective it would need to be made "easily" understandable; new users w basic familiarity with RNA-Seq and tidyverse?

Very basic knowledge. But we should not rely on any jargon or abbreviation. Basically the English dictionary should be enough to understand tidybulk and the underlying methods.

Check - clearly an educational goal as well which can synergize with the laudable further standardization/tidy-ing you guys are working on. I guess there is a trade-off between level of summarization of documentation (and functions) where advanced users may want more technical detail (in functions I guess configuration options could deal with that). But from the sound of it, the aim would be plenty of "basic knowledge" users still learning about RNA-Seq in general.

Some initial ideas:

add a bit more extensive workflow example that also serves as an RNA-Seq "best practices" example that touches upon a variety of common data processing steps (implementation eased by tidybulk); taking away some "education" from function documentation, with cross-references of the workflow example in the function documentation.
start from tibble format inputs, add/explain summarized experiment (benefits) further down or separately
use more recent datasets representative of what many users will run into (e.g. paired-end read data processed with common NGS pipelines, with some known/established differential processes, ideally some multi-group)

Perhaps some of that sounds obvious - just trying to organize thoughts, get some initial feedback

Here 2 initial ideas for datasets (again, open to feedback/other suggestions)

GSE62944 TCGA data (eg BRCA tumor vs adj.normal) http://bioconductor.org/packages/release/bioc/vignettes/GSEABenchmarkeR/inst/doc/GSEABenchmarkeR.html#rna-seq-compendium
macrophage activation https://bioconductor.org/packages/release/data/experiment/html/macrophage.html

getting the backend code exposed as you suggested sounds like I plan (so we can figure out what is under the hood)

getting references to source methods used in the function sounds like a plan

perhaps reference or even descriptions from those source methods (e.g. as with combat?)

Agree

Probably good to touch base on this (virtual chat or zoom?) and how we wil go about that (in version control, who which functions etc). i still need to get more familiar with roxigen too (not sure if that would fit the timeline for your publication?)

Sure. Where are you based? I could do it in 2 hours from now.

I am on the US east coast - you're back in Melbourne again? @aliosmancetin - where are you?

stemangiola commented 1 year ago

Great initiative!

add a bit more extensive workflow example that also serves as an RNA-Seq "best practices" example that touches upon a variety of common data processing steps (implementation eased by tidybulk); taking away some "education" from function documentation, with cross-references of the workflow example in the function documentation.

We don't necessarily want to suggest the best way to do analysis. In this first instance just give transparency of what is happening underneath.

start from tibble format inputs,

Although we are still supporting tibble input, we are not recommending it anymore, as the tidyomics uses SummarizedExperiment as the backend.

add/explain summarized experiment (benefits) further down or separately

Probably we don't want to go that basic.

Let's take this task as an onion-like. We can first tackle the high-return part, which exposes the backend code (just the really relevant part of the method called) in the @ description part.

Maybe let's start from the adjust_abundance, and improve the transparency of that, so I can give some feedback, and we will be aligned for the rest of the methods.

Here 2 initial ideas for datasets (again, open to feedback/other suggestions)

Thanks. Let's tackle this as a later task.

Having said all that. I think in the future, after tidybulk will be extremely well documented, we could

rewrite some of the books "Orchestrating ..." using tidy tools.
Add deeper educational material in the tidybulk repo itself.

One step at a time, we will climb the mountain ;)

stemangiola commented 1 year ago

@GrootJ I have added combat_seq to the adjust_abundance in the new master.

aliosmancetin commented 1 year ago

Hi guys @stemangiola @GrootJ,

Sorry for my late response. I was very busy recently, I am trying to move to new apartment. By the way, I am in Germany.

As far as I understand, first step we should do is to improve @description. It looks like most straightforward I think and requires less creativity more work. @GrootJ lets split up the methods that we will work on.

@stemangiola previously you mentioned documentation is based on methods.R. Isn't methods_SE.R included at all? This still confuses me because as you said before, although tidybulk is still supported, it is not recommended. Maybe default documentation should be changed with methods_SE.R. Tidybulk related differences or notes could be mentioned under @details section.

stemangiola commented 1 year ago

@stemangiola previously you mentioned documentation is based on methods.R. Isn't methods_SE.R included at all? This still confuses me because as you said before, although tidybulk is still supported, it is not recommended. Maybe default documentation should be changed with methods_SE.R. Tidybulk related differences or notes could be mentioned under @details section.

Yes I should have been clearer. We have three elements in tidybulk

1) Generic methods (http://adv-r.had.co.nz/S4.html) 2) tibble methods (also including the tidybulk enhanced tibbles) 3) SummarizedExperiment methods

The documentation is relative to (1). (2) and (3) do not need documentation, and they use the same underlying analysis methods, so the documentation for (1) will apply to (2) and (3)

stemangiola commented 10 months ago

Any news team?

chilampoon commented 9 months ago

Any news team?

will update soon, any new requests?

aliosmancetin commented 9 months ago

Any news team?

From my side, I don't have unfortunately.

stemangiola / tidybulk

Improve the documentation for `tidybulk` #282