Closed jgallowa07 closed 1 year ago
Thanks for putting this together, @jgallowa07.
I suspect that we will find it easier to use the pre-built images for phipdata (https://quay.io/repository/biocontainers/bioconductor-phipdata?tab=tags) and that there's an appreciable chance that it would be harder to add phipdata to the phip-flow docker image. That's just based on my experience working with installing R dependencies in Docker in the past.
I'm not absolutely horrible with R, although it's not my strong suit. I'm happy to take the first crack at this if you like. Otherwise, if you want to make the first attempt and have me fill in the details that would be fine too. Just let me know your preference.
Oh, didn't catch the pre-built image. Great!
I'll let you take the lead, and come in whenever I can be most helpful. Thank you!
I'd love to add the BEER results upstream of the AGG subworkflow, so that those results can be wrapped into the reports that Terry is looking at. How hard do you think it would be to take the results from BEER and add them to the xarray?
Do you think that's something I could pass back to you once I've got BEER working, @jgallowa07?
@sminot Yes, reading data back into the xarray dataset and merging should be easy - and I'm happy to do that once you have some BEER results to confirm.
In this case the process rule would be below (input from) replicate_counts in parallel to CPM, and size_factors in the STATS workflow. We could then pretty easily merge the results back in to sit nicely with everything else.
I just pushed a commit to the add_beer
branch. I'm able to get the data set up as a PhIPData object. However, when I run the first step in the BEER tutorial (running edgeR) I get an error message:
Stop worker failed with the error: wrong args for environment subassignment
Error: BiocParallel errors
1 remote errors, element index: 1
11 unevaluated and other errors
first remote error:
Error in edgeR::exactTest(data): dispersion is NA
In addition: Warning message:
In estimateDisp.default(y = y$counts, design = design, group = group, :
There is no replication, setting dispersion to NA.
I also tried running BEER directly on the PhIPData object without first running edgeR, but I got a different error message. When I looked in the BEER source code to debug that error, I found that it was looking for the edgeR outputs.
@jgallowa07, do you think that it's possible that edgeR won't run on datasets which have only a single beads-only control, as this test dataset does? It might be worth digging in a bit to see if that is true, in which case we might need to make this conditional on >1 beads-only control. I just wanted to check and see what you thought before I started to go down that road.
I suppose I'll just punt on that question for now and use a test dataset for development which has >1 beads-only control
Sorry - I’m out this week. I’ll update the test dataset this upcoming week to include more than one beads only. I’m not sure if BEER requires more than one mock IP but I imagine that’s the case given the approach. And indeed if so yes a simple requirement would be the route to take IMO. I’ll look more closely early next week as well. Thanks so much, Sam!
In my local test run of the add_beer
branch, I'm getting this error from BEER itself:
── Running JAGS ────────────────────────────────────────────────────────────────
Sample runs
Error in reducer$value.cache[[as.character(idx)]] <- values :
wrong args for environment subassignment
Calls: brew ... .bploop_impl -> .collect_result -> .reducer_add -> .reducer_add
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result
This is the point in the project when we have to think long and hard about whether it's worth it to dig into this error and figure out whether it can be fixed and how. When you get a chance, if you could run your test data @jgallowa07 and see if you get a similar error, that would be greatly appreciated. It could be that this tool just isn't robust enough to adopt into our larger workflow.
@sminot - I've run your branch on a new subset of the data that included two beads_only samples, and believe it ran to completion?
(base) quokka 40/aaf64e46b1cd03b9ff84f44476efdb » tail -n 15 .command.log
[1] "Building PhIPData object"
! Missing peptide start and end position information. Replacing missing values with 0.
[1] "Running edgeR"
[1] "Adding edgeR hits"
[1] "Setting up BEER"
[1] "Running BEER::brew"
── Running JAGS ────────────────────────────────────────────────────────────────
Sample runs
── Summarizing results ─────────────────────────────────────────────────────────
[1] "Getting matrix of peptides that were run"
[1] "Identifying super-enriched peptides"
[1] "Rerunning BEER"
and
(base) quokka 40/aaf64e46b1cd03b9ff84f44476efdb » tail -n 15 .command.err
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
! Missing peptide start and end position information. Replacing missing values with 0.
── Running JAGS ────────────────────────────────────────────────────────────────
Sample runs
── Summarizing results ─────────────────────────────────────────────────────────
The exit code was 140, which seems as if it may have something to do with exceeding resource allocations .. but maybe it's simply that the script doesn't write the output file yet?
Oh, that's great!
Can you get me oriented to your test data setup? I'll switch to using that. And maybe we'll just make the workflow robust to the error case and have it omit the BEER results if they can't be made. But at least if I can use this test data then I can get the workflow set up correctly for the best case scenario.
I randomly sampled 2 library controls, 2 beads_only controls, 2 Healthy, and 2 covid-positive samples from the same Pan-CoV dataset described here. The full reads files were run to give me the results posted above. I took the first 12000 lines (3000 reads) of each of these and pushed them here. I'm running now on the cluster, as before -- and no failure yet. But it's been running for a couple hours. I'm assuming my previous run from above timed out given the 140.
quick update: The test set pushed to add_beer
branch reproduced the results above.
The pipeline was run from our group node, quokka, submitting to the gizmos like so:
#!/bin/bash
set -e
source /app/lmod/lmod/init/profile
module load Nextflow/22.04.0
module load Singularity
export PATH=$SINGULARITYROOT/bin/:$PATH
/usr/bin/time nextflow main.nf \
--sample_table data/pan-cov-example-with-beads/sample_table.csv \
--peptide_table data/pan-cov-example-with-beads/peptide_table.csv \
-with-report output_local/nextflow_report.html \
-profile cluster
When you say "reproduced the results", which results are you referring to? I'm getting the run through on my laptop to work out a failure exception catching approach.
Sorry I just mean the same log output up to “——summarizing results———“ before the 140 exit code
Well this is odd, I got the following error from your test dataset:
Error in reducer$value.cache[[as.character(idx)]] <- values :
wrong args for environment subassignment
Calls: brew ... .bploop_impl -> .collect_result -> .reducer_add -> .reducer_add
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result
Execution halted
Error while shutting down parallel: unable to terminate some child processes
I think I'll try running it on the cluster next
Huh. Likewise, I’ll try running locally.
@jgallowa07, what parameters are you using for this test dataset?
All defaults except pointing to the new sample/peptide table.
Hey @sminot, late follow up as we're finalizing a re-submission of the ms.
I took another swing at running beer locally on a test set (without library samples) for a little longer and it looks like it was successful after about 2 1/2 hours (!) I pushed the test dataset to the add_beer
branch.
For reference, I ran the current bleeding edge of add_beer
like so:
nextflow run main.nf \
-profile docker \
--sample_table data/pan-cov-example-with-beads-no-lib/sample_table.csv \
--peptide_table data/pan-cov-example-with-beads-no-lib/peptide_table.csv \
--output_tall_csv true \
--output_wide_csv true \
--results $(date -I)-test
Note that the profile for batch submission (-profile cluster
) defaults the time limit to 1hr so that would need to be changed if running there.
The stdout for run_beer looked like:
[1] "Building PhIPData object"
[1] "Running edgeR"
[1] "Adding edgeR hits"
[1] "Setting up BEER"
[1] "Running BEER::brew"
[1] "Getting matrix of peptides that were run"
[1] "Identifying super-enriched peptides"
[1] "Rerunning BEER"
And the nf error being:
Error executing process > 'STATS:beer_enrichment:run_beer (1)'
Caused by:
Missing output file(s) `beer_results.tsv` expected by process `STATS:beer_enrichment:run_beer (1)`
Command executed:
run_beer.Rscript
Obviously because there's no code for writing that tsv currently. So this looks good (although I haven't looked at any of the actual results)!
While this analysis takes a long time to run (which they note in the BEER paper), I suppose it's probably worth finishing the integration here? I'm happy to spend a little time to do this (shouldn't take long) but wanted to see if y'all had any thoughts before moving ahead (cc @matsen @ksung25).
Sam, AFAICT this would include:
phip_obj
and write that to binary using saveRDS()
(optionally?).beer_enrichment
workflow.After that I think the pipeline should handle merging it with the rest of the stats and outputting tall / wide / pickle xarray with all the EdgeR/BEER results tied in.
Hey @jgallowa07, I'm really glad to hear that you were able to make it further through this process. I've made a clean testing environment with your updated add_beer
branch and am testing it out locally as well.
I like the idea of continuing the integration, and your steps seem reasonable.
The one piece of architecture that I'd like to make sure we have a plan for is robust error handling. Based on our experience with BEER up until this point, I don't think I have a good idea of when it will or will not complete successfully. To make the user experience more streamlined, I would advocate for a setup where a failure at the BEER step is handled gracefully and the rest of the workflow can just proceed without it.
That sort of conditional functionality is not incredibly hard to implement in Nextflow, but I thought we should talk over that high-level goal before digging into the implementation.
What do you think about making BEER a nice-to-have but not required output from the pipeline?
Awesome - thanks for the sanity check :crossed_fingers: ! And absolutely agree no requirement of success needed. Off the top of the head, I think we could do one or more of the following to achieve this behavior.
Use errorStrategy 'ignore'
- and maybe point the user to the working directory for that process in case they want to investigate further? Or maybe we just let NF do that for them? you probably know what's best here.
I've found that running edgeR
is fast, and apparently produces good results on it's own, so maybe we should separate that from the beer run process? In the paper they note
Since edgeR does not rely on MCMC procedures and thus generates inference faster than BEER, we further investigated the performance of edgeR analyzing PhIP-Seq data in a two-group comparison with the mock IP samples in one group and a single serum sample in the other group (assuming equality of group variances) and found that BEER and edgeR are equally effective in detecting moderately and strongly reactive peptides, but BEER is needed to detect weakly reactive peptides
PhIP-Seq experiments require the use of negative controls (so-called ‘beads-only’ or ‘mock immunoprecipitations’ lacking antibody input), which are included as 4–8 wells of a 96-well plate, and are compared to the read counts from each individual serum sample on the plate.
I like the idea of running edgeR independently and outputting those results regardless of whether BEER finishes.
Using errorStrategy 'ignore'
is definitely the right approach, with two considerations that come up for me:
errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'ignore' }
. Obviously this works best with autoscaling resources (e.g. memory { 8.GB * task.attempt }
and cpus { 4 * task.attempt }
).collect()
on the upstream channel and making the consuming process deal explicitly with the empty inputs case. Looking at the current code, I think this should be already handled pretty well. We just need to make sure that the merge_binary_datasets
task works equally well with and without the BEER results being created upstream.Another question -- how much of this were you planning on tackling vs. needing my help on? I'm invested in helping you get this wrapped up, but I also don't need to take anything away from you that you were already planning to do.
Amazing, I hadn't thought about either of those!
Another question -- how much of this were you planning on tackling vs. needing my help on?
Thanks Sam :) I am quite busy atm so any amount that you can tackle would be so so helpful. I'll probably have some time by Tuesday - but the branch is all yours until then if you have time.
Another thought: while I simply removed the library samples from the test dataset, we should probably avoid running either edgeR
or BEER
on those "ctrl' samples when they are indeed present in the dataset?
Here's where I get worried -- my 'sanity check' running of the clean add_beer
branch using the command you ran resulted in an error which did not produce any output inside the BEER task.
Since everything is working inside Docker and the inputs are the same, I'm at a loss to explain why this would work on your computer but not on mine.
An idle thought is that perhaps you have more resources allocated to Docker on your computer, and my failure may actually be an autokill from the Docker daemon which is not being reported out due to the mclapply wrapper. Here's my Docker resource allocation, in case yours is quite different:
Logs:
Digest: sha256:689897f1d177b6bb214cf71870ca064b85b6e53680b0b1525eab1e9cbc498129
Status: Downloaded newer image for quay.io/biocontainers/bioconductor-beer:1.2.0--r42hdfd78af_0
Loading required package: PhIPData
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
Loading required package: rjags
Loading required package: coda
Linked to JAGS 4.3.0
Loaded modules: basemod,bugs
Attaching package: ‘dplyr’
The following object is masked from ‘package:Biobase’:
combine
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following object is masked from ‘package:matrixStats’:
count
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
[1] "Building PhIPData object"
! Missing peptide start and end position information. Replacing missing values with 0.
[1] "Running edgeR"
[1] "Adding edgeR hits"
[1] "Setting up BEER"
[1] "Running BEER::brew"
── Running JAGS ────────────────────────────────────────────────────────────────
Sample runs
Error in reducer$value.cache[[as.character(idx)]] <- values :
wrong args for environment subassignment
Calls: brew ... .bploop_impl -> .collect_result -> .reducer_add -> .reducer_add
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result
Execution halted
Any ideas, @jgallowa07 ?
No, unfortunately. It seems like you have plenty of resources. I only saw 2 CPU's being used and not more than a few gigs of RAM. Happy to run it with the nextflow report and send along if you're curious.
I just use docker desktop on ubuntu here's what docker info
gives me:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.0-docker)
Server:
Containers: 7
Running: 0
Paused: 0
Stopped: 7
Images: 35
Server Version: 20.10.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-148-generic
Operating System: Ubuntu 20.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 24.44GiB
Name: ubuntu
ID: HKZQ:HQYQ:4NG6:4U66:4VOB:3Q43:QWYT:4TO7:IWQT:YNUL:YTEI:ORKD
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: jgallowa
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support
Unless this tells you something obvious we should just run edge R and output the PhIPData Object for the users to run BEER themselves. That would be sufficient for the MS. Sorry for the headache!
Maybe this is relevant.
Doing brew(sim_data, BPPARAM = BiocParallel::SerialParam())
is how you force the process not to multithread which may be causing issues for you? Although my beer seems to be running on 2 cores fine.
That's all I can think of - but it seems like this is enough of pain that we should just export the R object and call it a wrap.
I'm inclined to agree. And even if we figure out how to get make my system compatible I suspect that we'd be exposing ourselves to all sorts of error reports from users coming from the BEER step. Let's export the data so that they can run BEER themselves, and then they won't submit bug reports here when they have trouble. That way we also don't have to work on keeping this workflow in sync with updates to BEER, which will likely be updated considerably if it gets traction.
Totally. Thank you for sanity checking!
Bringing this thread back to life again to address @matsen 's comments from slack
I do feel like we're leaving things in a weird state, that there is something that would seem great to incorporate but we aren't including it because we have some hiccup that we haven't even raised on their side. If it was producing bad results or had problematic behavior that would be something else. You don't think?
My response is based off of discussion above with @sminot
Even while it does run for me, it is very slow (~ 3 hrs for MCMC 4 empirical samples). IMO this is an additional hiccup that changes things a bit. The authors mention this a few times in the paper - stating that edgeR is much faster and sufficient for detecting moderate-to-high significance. In my mind, that is enough justification to include the option to run edgeR and not BEER. If it's easy for folks to load in the RDS and run BEER in their own environment then we may even save the user some headache because their environment can use resources more flexibly the the pipeline could (in the case of using containers, batch submission etc).
Conversely, if we include the option to RUN BEER we would need to
@sminot Sorry to raise re-raise this issue (for a 3rd time? lol) but could you summarize your thoughts in response to this? Thanks All
My larger perspective on this question is that BEER represents a significant amount of technical debt, and I don't see that it clearly adds enough value to warrant taking that on.
The situations I foresee arising from the inclusion of BEER which could take up our time in the future are when a user selects the option to run BEER and:
Put another way, if we add BEER it will instantly become the most breakable and troublesome part of the whole pipeline. It's a recently developed piece of software, which is never a good sign. We don't have control or understanding of the source code. We aren't able to fix bugs that we find. We don't even have a good understanding of how much value it provides. Lastly, it's not like BEER is a well established standard in the field that people will expect us to support. This is not intended as a criticism of BEER specifically, just a practical skepticism of any newly built tooling.
If the goal of phip-flow is to provide a robust utility for processing PhIPseq data, then I think we'd be better served by leaving out BEER for now and reevaluating in a year or two to see if the method has been maintained and improved.
Looking forward to chatting over video, but the thing that stands out for me is that we are already running BEER for edgeR.
BEER -> phip-flow
@sminot.
BEER has presented a very nice methodology for analyzing significant binding events. This method and software is described here . The documentation for the R interface can be found here. We would like to integrate this analysis into the phip-flow pipeline. Here's an outline about how I imagine this happening:
The process location in DAG A quick and easy way to go about this is simply extend the DAG like the AGG sub workflow does. From here, we can pretty easily get all the inputs we need for the BEER analysis, i.e. counts for beads and empirical IP's. If we extend from the binary, we can use the python API utilities to easy isolate the inputs for BEER as well. I'm agnostic to whether or not we use the Python API to do this.
Dependencies Looks like we will need [PhIPData(https://www.bioconductor.org/packages/release/bioc/vignettes/PhIPData/inst/doc/PhIPData.html), and BEER R bioconductor packages. I'm very unfamiliar with R packaging, but I imagine extending this image to include the R dependencies should not be too bad.
Transformation The PhIPData Object is a near identical data structure to the xarray phipdata we use in phippery (hilarious how similar their organization of the data is to ours) Theirs: Our xarray: So this shouldn't be hard. Essentially we need to read in the sample, peptide, and peptide counts table in an R script in able to run BEER. I think this simply requires readr matrix objects from the wide output of phippery. I'm very bad with R, but it also seems that we may need to specify data types in each of the sample and peptide annotation tables when their read in.
Output Let's just stick with outputting whatever is easiest from the R pipeline for now? I imagine we could tie things back into the xarray dataset pretty easily, but this is not necessary for the manuscript re-submission.