Creating a Dockerfile for using Librarian `--local` mode

kartva / Librarian

A tool to predict the sequencing library type from the base composition of a supplied FastQ file.

https://kartva.github.io/Librarian/

GNU General Public License v3.0

7 stars 4 forks source link

Creating a Dockerfile for using Librarian `--local` mode #17

Open a-frantz opened 1 year ago

a-frantz commented 1 year ago

I am trying to incorporate librarian into our quality-check-standard pipeline, but there doesn't seem to be a feasible way to run this at scale.

I assume if I launched a few thousand concurrent jobs pointed at the Babraham server, I'd either get blocked or crash the server.
I'd have similar concurrency issues setting up my own server
Requiring a separate server defeats the design ethos of our pipeline, which runs with minimal set-up and dependencies.

I was excited to see the documentation mention the --local option, because it would permit the large scale runs I'm planning. All I'd need is a Docker image with the CLI and dependencies and I'd be good to go to run this on thousands of samples. However when I started playing around with the CLI, I realized the --local option isn't present.

Just a suggestion, but I'd recommend publishing the CLI to the Bioconda repository. It would make installation for your users much simpler. (And it would make my life easier, as I wouldn't have to maintain my own Docker image of your tool like I was planning).

kartva commented 1 year ago

Hello @a-frantz ! I'm glad to see you found Librarian useful enough to consider integrating into your quality-check pipeline. I had not yet created a release of the CLI that contained the --local option. I have done that now, so I hope that works for you.

I still need to confirm whether the Docker image runs since I haven't touched it in a long time.

a-frantz commented 1 year ago

Thank you for the quick response!

$ ~/tmp/librarian --local WGS_batch1_workdir/CCSS/GRCh38/FINAL/WHOLE_GENOME/BAM/SJGENLK053238_G1.WholeGenome/20230818_042451_quality_check/call-collate_to_fastq/work/SJGENLK053238_G1.WholeGenome_R1.fastq.gz
INFO [librarian] Processing "WGS_batch1_workdir/CCSS/GRCh38/FINAL/WHOLE_GENOME/BAM/SJGENLK053238_G1.WholeGenome/20230818_042451_quality_check/call-collate_to_fastq/work/SJGENLK053238_G1.WholeGenome_R1.fastq.gz"
WARN [librarian] Fewer valid reads (197) in sample "WGS_batch1_workdir/CCSS/GRCh38/FINAL/WHOLE_GENOME/BAM/SJGENLK053238_G1.WholeGenome/20230818_042451_quality_check/call-collate_to_fastq/work/SJGENLK053238_G1.WholeGenome_R1.fastq.gz" than recommended (100,000) (this may be due to reads being filtered out due to being shorter than 50 bases)
ERROR [server] Rscript failed with status exit status: 126
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RExit', cli/src/bin/librarian.rs:167:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is librarian supposed to be able to handle gzipped FastQs? I can't find a clear answer in the documentation about whether I need to uncompress my files first.

Do you plan to release a Docker image with all the dependencies for running in --local mode? If so, can I recommend hosting it on the GitHub Container Registry instead of docker.io? We've had to stop using images hosted on docker.io in our production pipelines because of rate-limiting issues. ghcr.io users will have unlimited pulls (I believe the same is true for quay.io if you have an issue with ghcr.io).

a-frantz commented 1 year ago

I attempted to create a Docker image capable of running in --local mode, but I'm facing errors. Here's the Dockerfile I wrote:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND noninteractive

WORKDIR /usr/local/bin

RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y wget r-base-core r-base-dev libssl-dev libcurl4-openssl-dev libxml2-dev \
    && Rscript -e 'install.packages(c("tidyverse", "umap", "ggrastr", "remotes", "rmarkdown"))' \
    && Rscript -e 'remotes::install_github("rstudio/pins")' \
    && wget https://github.com/DesmondWillowbrook/Librarian/releases/download/v1.1.0/librarian.tar.gz \
    && tar -xzf ./librarian.tar.gz

And here's the error I'm getting while running it:

$ docker run -v `pwd`:`pwd` -w `pwd` -t -e RUST_BACKTRACE=full librarian/1.1.0 librarian --local tmp.fastq
INFO [librarian] Processing "tmp.fastq"
ERROR [server] Rscript failed with status exit status: 126
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RExit', cli/src/bin/librarian.rs:167:30
stack backtrace:
   0:     0x7efecee453aa - std::backtrace_rs::backtrace::libunwind::trace::h9aa05f1e3ca324a0
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7efecee453aa - std::backtrace_rs::backtrace::trace_unsynchronized::ha0db50eab803ada8
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7efecee453aa - std::sys_common::backtrace::_print_fmt::h3afc4c540c5a7d95
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x7efecee453aa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h28fbb246bcdfddef
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7efecee8566e - core::fmt::write::h40b82b205703350d
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/fmt/mod.rs:1213:17
   5:     0x7efecee3fdd5 - std::io::Write::write_fmt::h080da795e1c1d497
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/io/mod.rs:1682:15
   6:     0x7efecee45175 - std::sys_common::backtrace::_print::h7eaf5317453f8f87
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x7efecee45175 - std::sys_common::backtrace::print::hff3c1cc68ba32752
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x7efecee4695f - std::panicking::default_hook::{{closure}}::hbd2ec87f6f318469
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:267:22
   9:     0x7efecee4669b - std::panicking::default_hook::h18c9417d36edc8b3
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:286:9
  10:     0x7efecee47069 - std::panicking::rust_panic_with_hook::h372e37e60d4a59db
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:688:13
  11:     0x7efecee46e09 - std::panicking::begin_panic_handler::{{closure}}::ha85b34d847b9b8d4
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:579:13
  12:     0x7efecee4585c - std::sys_common::backtrace::__rust_end_short_backtrace::hd1dcbca66cbef135
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:137:18
  13:     0x7efecee46b12 - rust_begin_unwind
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:575:5
  14:     0x7efeceabf523 - core::panicking::panic_fmt::hba5737630c5872d2
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/panicking.rs:64:14
  15:     0x7efeceabf9d3 - core::result::unwrap_failed::ha370c8c0925ccfea
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/result.rs:1790:5
  16:     0x7efecead01ad - librarian::main::hb5a1b9351bb98bef
  17:     0x7efecead8b83 - std::sys_common::backtrace::__rust_begin_short_backtrace::h45eac8e9e91b0391
  18:     0x7efeceac6529 - std::rt::lang_start::{{closure}}::h5dd4b1dbd57a7778
  19:     0x7efecee3b272 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h380d506f5c7e95b1
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/ops/function.rs:287:13
  20:     0x7efecee3b272 - std::panicking::try::do_call::ha6deefa4ce06476d
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:483:40
  21:     0x7efecee3b272 - std::panicking::try::ha2fc470c4501edef
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:447:19
  22:     0x7efecee3b272 - std::panic::catch_unwind::h2925049db6d1396f
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panic.rs:140:14
  23:     0x7efecee3b272 - std::rt::lang_start_internal::{{closure}}::h0ec8de9ef1d08d9e
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/rt.rs:148:48
  24:     0x7efecee3b272 - std::panicking::try::do_call::h6697fdb6f589d110
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:483:40
  25:     0x7efecee3b272 - std::panicking::try::h9d85f0456d7a332f
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:447:19
  26:     0x7efecee3b272 - std::panic::catch_unwind::h93152b3771a1d7bf
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panic.rs:140:14
  27:     0x7efecee3b272 - std::rt::lang_start_internal::hf92c8c2f2839001b
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/rt.rs:148:20
  28:     0x7efecead0aa5 - main

Any help would be appreciated! Thank you!

kartva commented 1 year ago

Is librarian supposed to be able to handle gzipped FastQs?

It is. It is mentioned in the --help option of the CLI:

Librarian CLI 1.1.0 A tool to predict the sequencing library type from the base composition of a supplied FastQ file. Uncompresses .gz files when reading.

The error you've encountered mentions RExit. The log messages also state that:

ERROR [server] Rscript failed with status exit status: 126

which leads me to believe the R script failed somewhere. I believe the reads were read accurately (through the compressed encoding), since the log messages also state the following:

WARN [librarian] Fewer valid reads (197) in sample ...

I currently do not have much of an idea of why the R script is failing. Can you try downloading the example files located in frontend/example_inputs.zip and using those to test the CLI? If the example files pass, would it be possible to share the temp file you are using to test the CLI?

The note about Docker registries is much appreciated. I'll also have a look at Bioconda.

kartva commented 1 year ago

Oh, and you can also set RUST_LOG=trace as an environment variable to turn on debug logs in the CLI.

RUST_LOG=trace librarian [files] ...

a-frantz commented 1 year ago

No dice on the example files either.

$ docker run -v `pwd`:`pwd` -w `pwd` -t -e RUST_BACKTRACE=full librarian/1.1.0 librarian --local example_inputs/*
INFO [librarian] Processing "example_inputs/ATAC.example.fastq"
INFO [librarian] Processing "example_inputs/RNA.example.fastq"
INFO [librarian] Processing "example_inputs/bisulfite.example.fastq"
ERROR [server] Rscript failed with status exit status: 126
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RExit', cli/src/bin/librarian.rs:167:30
stack backtrace:
   0:     0x7f509cbe83aa - std::backtrace_rs::backtrace::libunwind::trace::h9aa05f1e3ca324a0
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f509cbe83aa - std::backtrace_rs::backtrace::trace_unsynchronized::ha0db50eab803ada8
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f509cbe83aa - std::sys_common::backtrace::_print_fmt::h3afc4c540c5a7d95
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x7f509cbe83aa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h28fbb246bcdfddef
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7f509cc2866e - core::fmt::write::h40b82b205703350d
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/fmt/mod.rs:1213:17
   5:     0x7f509cbe2dd5 - std::io::Write::write_fmt::h080da795e1c1d497
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/io/mod.rs:1682:15
   6:     0x7f509cbe8175 - std::sys_common::backtrace::_print::h7eaf5317453f8f87
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x7f509cbe8175 - std::sys_common::backtrace::print::hff3c1cc68ba32752
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x7f509cbe995f - std::panicking::default_hook::{{closure}}::hbd2ec87f6f318469
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:267:22
   9:     0x7f509cbe969b - std::panicking::default_hook::h18c9417d36edc8b3
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:286:9
  10:     0x7f509cbea069 - std::panicking::rust_panic_with_hook::h372e37e60d4a59db
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:688:13
  11:     0x7f509cbe9e09 - std::panicking::begin_panic_handler::{{closure}}::ha85b34d847b9b8d4
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:579:13
  12:     0x7f509cbe885c - std::sys_common::backtrace::__rust_end_short_backtrace::hd1dcbca66cbef135
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/sys_common/backtrace.rs:137:18
  13:     0x7f509cbe9b12 - rust_begin_unwind
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:575:5
  14:     0x7f509c862523 - core::panicking::panic_fmt::hba5737630c5872d2
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/panicking.rs:64:14
  15:     0x7f509c8629d3 - core::result::unwrap_failed::ha370c8c0925ccfea
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/result.rs:1790:5
  16:     0x7f509c8731ad - librarian::main::hb5a1b9351bb98bef
  17:     0x7f509c87bb83 - std::sys_common::backtrace::__rust_begin_short_backtrace::h45eac8e9e91b0391
  18:     0x7f509c869529 - std::rt::lang_start::{{closure}}::h5dd4b1dbd57a7778
  19:     0x7f509cbde272 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h380d506f5c7e95b1
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/core/src/ops/function.rs:287:13
  20:     0x7f509cbde272 - std::panicking::try::do_call::ha6deefa4ce06476d
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:483:40
  21:     0x7f509cbde272 - std::panicking::try::ha2fc470c4501edef
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:447:19
  22:     0x7f509cbde272 - std::panic::catch_unwind::h2925049db6d1396f
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panic.rs:140:14
  23:     0x7f509cbde272 - std::rt::lang_start_internal::{{closure}}::h0ec8de9ef1d08d9e
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/rt.rs:148:48
  24:     0x7f509cbde272 - std::panicking::try::do_call::h6697fdb6f589d110
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:483:40
  25:     0x7f509cbde272 - std::panicking::try::h9d85f0456d7a332f
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panicking.rs:447:19
  26:     0x7f509cbde272 - std::panic::catch_unwind::h93152b3771a1d7bf
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/panic.rs:140:14
  27:     0x7f509cbde272 - std::rt::lang_start_internal::hf92c8c2f2839001b
                               at /rustc/8460ca823e8367a30dda430efda790588b8c84d3/library/std/src/rt.rs:148:20
  28:     0x7f509c873aa5 - main

kartva commented 1 year ago

Ok, it seems that the R script is failing in some way. The next debugging step would be running the Rscript yourself.

I recommend you try running the following command in the librarian folder:

echo -en "sample_01\tsample_name_01\t23\t25\t38\t13\t0\t22\t23\t16\t36\t0\t25\t25\t17\t30\t0\t18\t28\t12\t39\t0\t37\t12\t12\t36\t0\t39\t12\t28\t18\t0\t31\t17\t26\t24\t0\t37\t17\t22\t22\t0\t17\t34\t23\t24\t0\t29\t20\t15\t35\t0\t25\t18\t31\t23\t0\t28\t21\t19\t30\t0\t23\t22\t27\t27\t0\t24\t27\t19\t27\t0\t33\t21\t21\t23\t0\t30\t20\t20\t27\t0\t31\t19\t19\t29\t0\t30\t19\t19\t30\t0\t29\t19\t20\t30\t0\t30\t19\t19\t29\t0\t30\t20\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t19\t30\t0\t30\t20\t20\t29\t0\t29\t20\t19\t30\t0\t29\t20\t19\t29\t0\t30\t20\t19\t29\t0\t28\t20\t20\t30\t0\t30\t20\t19\t29\t0\t30\t19\t19\t30\t0\t29\t19\t20\t30\t0\t30\t20\t20\t29\t0\t30\t20\t19\t29\t0\t29\t20\t20\t30\t0\t29\t20\t20\t29\t0\t30\t20\t19\t29\t0\t29\t20\t19\t30\t0\t29\t20\t19\t30\t0\t28\t21\t19\t30\t0\t28\t20\t19\t31\t0\t28\t21\t19\t31\t0\t27\t21\t19\t31\t0\t27\t21\t19\t31\t0\t28\t21\t18\t31\t0\t28\t21\t18\t31\t0\t27\t21\t18\t32\t0\t28\t21\t18\t31\t0\n" | bash scripts/exec_analysis.sh ..

If successful, it should create a bunch of files in your librarian folder that represent the analysis of the input compositions. (The echo command just contains a valid composition that can be analyzed)

You can control the path of the output by changing the argument to the exec_analysis.sh script. (.. will output one level above the scripts directory)

a-frantz commented 1 year ago

root@09874c61cc1e:/usr/local/bin# echo -en "sample_01\tsample_name_01\t23\t25\t38\t13\t0\t22\t23\t16\t36\t0\t25\t25\t17\t30\t0\t18\t28\t12\t39\t0\t37\t12\t12\t36\t0\t39\t12\t28\t18\t0\t31\t17\t26\t24\t0\t37\t17\t22\t22\t0\t17\t34\t23\t24\t0\t29\t20\t15\t35\t0\t25\t18\t31\t23\t0\t28\t21\t19\t30\t0\t23\t22\t27\t27\t0\t24\t27\t19\t27\t0\t33\t21\t21\t23\t0\t30\t20\t20\t27\t0\t31\t19\t19\t29\t0\t30\t19\t19\t30\t0\t29\t19\t20\t30\t0\t30\t19\t19\t29\t0\t30\t20\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t20\t29\t0\t30\t19\t19\t30\t0\t30\t20\t20\t29\t0\t29\t20\t19\t30\t0\t29\t20\t19\t29\t0\t30\t20\t19\t29\t0\t28\t20\t20\t30\t0\t30\t20\t19\t29\t0\t30\t19\t19\t30\t0\t29\t19\t20\t30\t0\t30\t20\t20\t29\t0\t30\t20\t19\t29\t0\t29\t20\t20\t30\t0\t29\t20\t20\t29\t0\t30\t20\t19\t29\t0\t29\t20\t19\t30\t0\t29\t20\t19\t30\t0\t28\t21\t19\t30\t0\t28\t20\t19\t31\t0\t28\t21\t19\t31\t0\t27\t21\t19\t31\t0\t27\t21\t19\t31\t0\t28\t21\t18\t31\t0\t28\t21\t18\t31\t0\t27\t21\t18\t32\t0\t28\t21\t18\t31\t0\n" | bash scripts/exec_analysis.sh ..
Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted

After seeing that, I tried adding pandoc to my Docker image. However, the way I did it (&& Rscript -e 'install.packages(c("tidyverse", "umap", "ggrastr", "remotes", "rmarkdown", "pandoc"))' \ didn't appear to work. Got the same error. Now I don't know how to move forward.

kartva commented 1 year ago

I tried search in my terminal history to find the packages I had installed, and the one that I seem to have installed and the above command doesn't include is svglite. That however seems unlikely to be the issue (plus I have no idea why I installed that).

Can you share the Dockerfile? Maybe I can have a go at debugging the issue myself.

Edit: I saw that you've already shared the Dockerfile. Is this the same as the one you are using, except the addition of pandoc?

a-frantz commented 1 year ago

I tried search in my terminal history to find the packages I had installed, and the one that I seem to have installed and the above command doesn't include is svglite. That however seems unlikely to be the issue (plus I have no idea why I installed that).

Can you share the Dockerfile? Maybe I can have a go at debugging the issue myself.

Edit: I saw that you've already shared the Dockerfile. Is this the same as the one you are using, except the addition of pandoc?

Yes, but just to be redundant I've copied and pasted the current Dockerfile that's failing:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND noninteractive

WORKDIR /usr/local/bin

RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y wget r-base-core r-base-dev libssl-dev libcurl4-openssl-dev libxml2-dev \
    && Rscript -e 'install.packages(c("tidyverse", "umap", "ggrastr", "remotes", "rmarkdown", "pandoc"))' \
    && Rscript -e 'remotes::install_github("rstudio/pins")' \
    && wget https://github.com/DesmondWillowbrook/Librarian/releases/download/v1.1.0/librarian.tar.gz \
    && tar -xzf ./librarian.tar.gz

s-andrews commented 1 year ago

See also #18 which means that even if pandoc is installed this may not work until the next release comes out.

a-frantz commented 1 year ago

@DesmondWillowbrook any progress here? I'd love to be able to incorporate librarian data in the QC paper we're working on, but we can't hold up our pipeline release on one external tool.

Just to reiterate our needs: A Docker image (not hosted on a registry with pull limits) that can run librarian in --local mode.

I'm willing to write/host the Docker image on our ghcr.io packages page. As you can see, we've wrapped other tools for our own use (and of course anyone else can use them as well).

But I think you'd reach the most users by publishing to the Bioconda repo, which should automatically create Biocontainers recipes for you, and then a Docker image will be available on quay.io under the biocontainers namespace. That's how we've been publishing, distributing, and using most of our CLI tools, like our ngsderive package. The Bioconda project is phenomenal, I highly recommend hosting librarian there.

Please get back to me if I can help out in any way 👍

kartva commented 1 year ago

@a-frantz I have pinged @s-andrews and @ChristelKrueger to assist on the issue. We are currently testing a release which fixes #18, and I am having a look at the Docker file again.

There doesn't seem to be much of a problem with hosting the file on either ghcr.io or quay.io, so once we confirm that the Docker file works I'll upload it to one of those two.

s-andrews commented 1 year ago

The new release seems to fix all of the outstanding issues I'm aware of. @DesmondWillowbrook if you can try a docker build with this then we could be good to go.

kartva commented 1 year ago

I have rewritten the Dockerfile and gotten this error:

DEBUG [server] Rscript stderr: Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted

Hopefully we can now resolve the problem of pandoc not being installed as well.

kartva commented 1 year ago

This leads to some positive results:

RUN apt-get update && \
    apt-get install -y pandoc libssl-dev libcurl4-openssl-dev libxml2-dev 
RUN Rscript -e 'install.packages(c("tidyverse", "umap", "ggrastr", "pins", "rmarkdown"))'

kartva commented 1 year ago

@a-frantz I've written a new Dockerfile and pushed it to the repo root. I have tested it on my system and it seems to work. I'm looking into uploading it to ghcr.io now.

kartva commented 1 year ago

@a-frantz Sorry to ping you again, but do try the uploaded Docker image.

kartva commented 1 year ago

It reads from /app/in and puts the output in /app/out. Here's the command I used in the repo root:

docker run \
 -v `pwd`/frontend/example_inputs/example_inputs/:/app/in \
 -v `pwd`/out:/app/out \
 -e RUST_LOG='trace' \
 -t ghcr.io/desmondwillowbrook/librarian \
 /app/in/RNA.example.fastq

I suspect that some improvements can be made to make the /app/in/RNA.example.fastq part amenable to glob expansion.

s-andrews commented 1 year ago

@a-frantz Sorry to ping you again, but do try the uploaded Docker image.

That URL gives a 404 error

kartva commented 1 year ago

Fixed now!

a-frantz commented 1 year ago

@DesmondWillowbrook I've got a WDL prototype but it doesn't seem to like your Docker image. The first problem I ran into is that librarian is not on the PATH. Easy to solve, I just had to specify the location of the binary (/app/librarian). But it would be preferable to just have the binary on the PATH so that it doesn't matter what directory the user is in.

Second problem is fatal. Not really sure what's happening, but I'll share STDOUT and STDERR STDOUT:

INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R1.fastq.gz"
INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R2.fastq.gz"
ERROR [server] Rscript failed with status exit status: 1

STDERR:

thread 'main' panicked at 'Error plotting compositions: R script exited unsuccessfully', cli/src/bin/librarian.rs:167:30
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/result.rs:1651:5
   3: librarian::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Not sure what other information I have that would be helpful, but feel free to ask Qs and I'll see what I can dig up.

kartva commented 1 year ago

Try setting the environment variable RUST_LOG=trace.

RUST_LOG=trace /app/librarian --local -o ~{prefix} ~{read_one_fastq} ~{read_two_fastq}

This will enable debug logging and make finding the reason for the failure much easier.

a-frantz commented 1 year ago

New to Rust so not sure the difference, but RUST_LOG=trace didn't enable any extra logging. Less actually. But here's the output from RUST_BACKTRACE=full.

thread 'main' panicked at 'Error plotting compositions: R script exited unsuccessfully', cli/src/bin/librarian.rs:167:30
stack backtrace:
   0:     0x2aaaaaee7c41 - std::backtrace_rs::backtrace::libunwind::trace::hd4ea3ee8a54906d1
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x2aaaaaee7c41 - std::backtrace_rs::backtrace::trace_unsynchronized::hb1a2d1da57c55e75
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x2aaaaaee7c41 - std::sys_common::backtrace::_print_fmt::ha020ae1a8a6e7652
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x2aaaaaee7c41 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9f8b5092c5aa701c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x2aaaaaf3172f - core::fmt::rt::Argument::fmt::h08453fb0e29bdaee
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
   5:     0x2aaaaaf3172f - core::fmt::write::hf3858e39e16479b1
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
   6:     0x2aaaaaee3c17 - std::io::Write::write_fmt::h15da7eaa151b6970
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
   7:     0x2aaaaaee7a55 - std::sys_common::backtrace::_print::ha534c3c7329a338a
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x2aaaaaee7a55 - std::sys_common::backtrace::print::h61e8291cc38f15df
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x2aaaaaee8f83 - std::panicking::default_hook::{{closure}}::h00e02ec987560e96
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
  10:     0x2aaaaaee8d14 - std::panicking::default_hook::ha1e4b3a7bcabe548
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
  11:     0x2aaaaaee9509 - std::panicking::rust_panic_with_hook::hd0af6192a7f5c8c0
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
  12:     0x2aaaaaee9407 - std::panicking::begin_panic_handler::{{closure}}::h142dcdc9390ef36c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
  13:     0x2aaaaaee80a6 - std::sys_common::backtrace::__rust_end_short_backtrace::hf4e07a5e30bef7f3
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
  14:     0x2aaaaaee9152 - rust_begin_unwind
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
  15:     0x2aaaaab5ec03 - core::panicking::panic_fmt::ha8f81b88fc5ce542
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
  16:     0x2aaaaab5f0a3 - core::result::unwrap_failed::hbbea5bb339eab1b8
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/result.rs:1651:5
  17:     0x2aaaaab69aa7 - librarian::main::h50f2e7966fec73df
  18:     0x2aaaaab63a93 - std::sys_common::backtrace::__rust_begin_short_backtrace::h4abc478de1e8c239
  19:     0x2aaaaab7b489 - std::rt::lang_start::{{closure}}::h69cea5011335628f
  20:     0x2aaaaaedf0f2 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h2565476b49fcd7dc
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:284:13
  21:     0x2aaaaaedf0f2 - std::panicking::try::do_call::ha1014e3bb8270d94
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:500:40
  22:     0x2aaaaaedf0f2 - std::panicking::try::h47374bd91e24ad1c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:464:19
  23:     0x2aaaaaedf0f2 - std::panic::catch_unwind::h8a54a4989bd6bf92
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panic.rs:142:14
  24:     0x2aaaaaedf0f2 - std::rt::lang_start_internal::{{closure}}::hb8b1292915cb49e2
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/rt.rs:148:48
  25:     0x2aaaaaedf0f2 - std::panicking::try::do_call::h46799a8db12eba03
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:500:40
  26:     0x2aaaaaedf0f2 - std::panicking::try::h902bd23c128df220
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:464:19
  27:     0x2aaaaaedf0f2 - std::panic::catch_unwind::h5775836cd784f100
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panic.rs:142:14
  28:     0x2aaaaaedf0f2 - std::rt::lang_start_internal::hbfa2424132dddc00
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/rt.rs:148:20
  29:     0x2aaaaab6a3e5 - main

kartva commented 1 year ago

Easy to solve, I just had to specify the location of the binary (/app/librarian). But it would be preferable to just have the binary on the PATH so that it doesn't matter what directory the user is in.

I hadn't anticipated someone wanting to use the binary with the command line inside the Docker container. Unfortunately, Docker has broken on my machine so I am unable to build and push a new image right now.

kartva commented 1 year ago

RUST_LOG=trace didn't enable any extra logging. Less actually.

Are you sure RUST_LOG=trace was set as an environment variable and the binary was able to read it? They are emitted on stdout.

The logs should look something like this:

$ RUST_LOG=trace ./librarian --local ../../frontend/example_inputs/example_inputs/RNA.example.fastq
INFO [librarian] Processing "../../frontend/example_inputs/example_inputs/RNA.example.fastq"
...etc

a-frantz commented 1 year ago

Ah I didn't check STDOUT logs, only STDERR. You are correct, there's much more helpful debug info here with RUST_LOG=trace.

$ cat 20231002_164152_librarian/stdout.txt
INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R1.fastq.gz"
INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R2.fastq.gz"
DEBUG [librarian] Compositions: [
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 20,
                    T: 15,
                    G: 30,
                    C: 33,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 9,
                    T: 19,
                    G: 33,
                    C: 36,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
]
DEBUG [server] Input: "sample_01\tsample_name_01\t20\t33\t30\t15\t0\t18\t26\t29\t26\t0\t23\t26\t22\t27\t0\t30\t24\t22\t22\t0\t31\t20\t26\t21\t0\t35\t21\t22\t20\t0\t20\t24\t21\t33\t0\t22\t29\t23\t24\t0\t21\t28\t23\t25\t0\t33\t21\t23\t21\t0\t25\t28\t27\t18\t0\t23\t28\t25\t23\t0\t25\t25\t23\t25\t0\t25\t25\t24\t24\t0\t25\t25\t24\t24\t0\t25\t26\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t25\t23\t0\t26\t25\t24\t23\t0\t26\t25\t23\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t26\t24\t23\t0\t26\t26\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t26\t24\t23\t0\t25\t25\t24\t23\t0\nsample_02\tsample_name_02\t9\t36\t33\t19\t0\t7\t29\t32\t30\t0\t11\t33\t31\t23\t0\t15\t31\t30\t22\t0\t20\t23\t30\t25\t0\t21\t23\t31\t23\t0\t23\t15\t22\t38\t0\t21\t20\t24\t33\t0\t19\t21\t24\t33\t0\t28\t17\t23\t29\t0\t25\t22\t27\t24\t0\t22\t22\t26\t27\t0\t24\t21\t25\t28\t0\t24\t21\t25\t28\t0\t24\t22\t26\t27\t0\t24\t22\t25\t27\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\n"
DEBUG [server::tempdir] Tempdir: "/tmp/tmp.MeQVAdylZO"
DEBUG [server] Accessing script directory at path "/app/scripts"
DEBUG [server] Rscript stdout: 1/11
2/11 [original script]
3/11
4/11 [sample names]
5/11
6/11 [compositions plot]
7/11
8/11 [probability maps]
9/11
10/11 [heatmap]
11/11

DEBUG [server] Rscript stderr:

processing file: Librarian_analysis.Rmd
Error in file(con, "w") : cannot open the connection
Calls: <Anonymous> -> <Anonymous> -> write_utf8 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file 'Librarian_analysis.knit.md': Read-only file system
Execution halted

ERROR [server] Rscript failed with status exit status: 1
TRACE [server::tempdir] Deleting files.

Do you know where it's trying to write this Librarian_analysis.knit.md file? Looks like somewhere in /tmp. This example is actually a SIF image (Singularity), and I've run into odd permissions with these before. I'm running this on our institution's cluster, which is why your Docker image is being converted to SIF. I'll try running it locally on my Mac instead and see if the permissions issue persists.

Are any of these intermediate files large? I'd suspect not. It might be easiest to do all file writing in the CWD instead of /tmp dirs which can act funny from environment to environment. Just my 2 cents, but not entirely sure what's happening here, so maybe my advice doesn't address the actual issue. I've just had problems with certain environments (mainly Singularity containers) and /tmp directories. They aren't always as portable as they should be.

kartva commented 1 year ago

Excellent observation! Currently, the --local mode essentially runs the server code to interact with the R script; the server code uses /tmp/ since it will be processing many requests concurrently. I can modify the librarian executable to directly work in the current directory instead of going through temp.

s-andrews commented 1 year ago

I'd suggest using whichever directory has been set for the output of the program to generate the temp files. That way you're guaranteed to have a location which the user has write permission to.

kartva commented 11 months ago

Status update: I'll probably get around to getting this fixed by this weekend.

kartva commented 11 months ago

@a-frantz sorry for taking longer to fix this issue than planned. I've modified the cli to not use tmp as an intermediate directory and write the results straight to the supplied output directory. Note that I've renamed the flag prefix to output_dir.

Right now I'm trying to rebuild the Docker image so I can reupload it.

kartva commented 11 months ago

@a-frantz Everything should be in order. Let me know if it works.

a-frantz commented 11 months ago

@DesmondWillowbrook Still facing failure

INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R1.fastq.gz"
INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R2.fastq.gz"
DEBUG [librarian] Compositions: [
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 20,
                    T: 15,
                    G: 30,
                    C: 33,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 9,
                    T: 19,
                    G: 33,
                    C: 36,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
]
DEBUG [server] Input: "sample_01\tsample_name_01\t20\t33\t30\t15\t0\t18\t26\t29\t26\t0\t23\t26\t22\t27\t0\t30\t24\t22\t22\t0\t31\t20\t26\t21\t0\t35\t21\t22\t20\t0\t20\t24\t21\t33\t0\t22\t29\t23\t24\t0\t21\t28\t23\t25\t0\t33\t21\t23\t21\t0\t25\t28\t27\t19\t0\t23\t28\t25\t23\t0\t25\t25\t23\t25\t0\t25\t25\t24\t24\t0\t25\t25\t24\t24\t0\t25\t26\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t25\t23\t0\t26\t25\t24\t23\t0\t26\t25\t23\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t26\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t26\t24\t23\t0\t25\t25\t24\t23\t0\nsample_02\tsample_name_02\t9\t36\t33\t19\t0\t7\t29\t32\t30\t0\t11\t33\t31\t23\t0\t15\t31\t30\t22\t0\t20\t23\t30\t25\t0\t21\t23\t31\t23\t0\t23\t15\t22\t37\t0\t21\t20\t24\t33\t0\t19\t21\t24\t33\t0\t28\t17\t23\t29\t0\t25\t22\t27\t24\t0\t22\t22\t26\t27\t0\t24\t21\t25\t28\t0\t24\t21\t25\t28\t0\t24\t22\t26\t27\t0\t24\t22\t25\t27\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\n"
DEBUG [server::tempdir] Tempdir: "/tmp/tmp.dLxJ7CIw7w"
DEBUG [server] Accessing script directory at path "/app/scripts"
DEBUG [server] Rscript stderr:

processing file: Librarian_analysis.Rmd
Error in file(con, "w") : cannot open the connection
Calls: <Anonymous> -> <Anonymous> -> write_utf8 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file 'Librarian_analysis.knit.md': Read-only file system
Execution halted

DEBUG [server] Rscript stdout: 1/11
2/11 [original script]
3/11
4/11 [sample names]
5/11
6/11 [compositions plot]
7/11
8/11 [probability maps]
9/11
10/11 [heatmap]
11/11

ERROR [server] Rscript failed with status exit status: 1
TRACE [server::tempdir] Deleting files.

kartva commented 11 months ago

@a-frantz are you sure that you are using the latest container? This output shouldn't be possible with the latest image. I have reuploaded it in case that helps.

a-frantz commented 11 months ago

@a-frantz are you sure that you are using the latest container? This output shouldn't be possible with the latest image. I have reuploaded it in case that helps.

You are correct! My mistake. I forgot to clear my Singularity cache, so it picked up the "old" latest tag. However something is going on with Singularity on our cluster and I can't pull the new image. I'll comment again once I get this sorted out and can test the latest image. Thank you for your work!

You were probably already planning on doing this, but just in case you weren't: please release a "statically versioned" container (once we're done testing+developing here). As per the best practices guide I've written, our code isn't allowed to pull latest containers. They are frequently overwritten, which breaks reproducibility. Once we get this down and working, I'd appreciate a tag like 1.1.1 for the container so my pipeline can make use of it.

Thanks again for helping me through this! I'm really excited to play around with the results of librarian for our samples!

a-frantz commented 11 months ago

I can see result files in work/SRR23034262/. From what I can tell, it successfully completed the analysis (filled in work/SRR23034262/librarian_heatmap.txt), and then crashed while making some of the "fancy" outputs. Here's the TRACE, and a question after:

INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R1.fastq.gz"
INFO  [librarian] Processing "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/SRR23034262.R2.fastq.gz"
DEBUG [librarian] Compositions: [
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 20,
                    T: 15,
                    G: 30,
                    C: 33,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
    BaseComp {
        lib: [
            BaseCompCol {
                pos: 1,
                bases: BaseCompColBases {
                    A: 9,
                    T: 19,
                    G: 33,
                    C: 36,
                    N: 0,
                },
            }, ...
        ],
        reads_read: 100000,
    },
]
INFO  [librarian] Running locally, using workdir "/mnt/miniwdl_task_container/work/SRR23034262"
DEBUG [server] Input: "sample_01\tsample_name_01\t20\t33\t30\t15\t0\t18\t26\t29\t26\t0\t23\t26\t22\t27\t0\t30\t24\t22\t22\t0\t31\t20\t26\t21\t0\t35\t21\t22\t20\t0\t20\t24\t21\t33\t0\t22\t29\t23\t24\t0\t21\t28\t23\t25\t0\t33\t21\t23\t21\t0\t25\t28\t27\t18\t0\t23\t28\t25\t23\t0\t25\t25\t23\t25\t0\t25\t25\t24\t24\t0\t25\t25\t24\t24\t0\t25\t26\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t25\t23\t0\t26\t25\t24\t23\t0\t26\t25\t23\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t26\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t26\t24\t23\t0\t25\t25\t24\t23\t0\nsample_02\tsample_name_02\t9\t36\t33\t19\t0\t7\t29\t32\t30\t0\t11\t33\t31\t23\t0\t15\t31\t30\t22\t0\t20\t23\t30\t25\t0\t21\t23\t31\t23\t0\t23\t15\t22\t38\t0\t21\t20\t24\t33\t0\t19\t21\t24\t33\t0\t28\t17\t23\t29\t0\t25\t22\t27\t24\t0\t22\t22\t26\t27\t0\t24\t21\t25\t28\t0\t24\t21\t25\t28\t0\t24\t22\t26\t27\t0\t24\t22\t25\t27\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t22\t25\t27\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\t24\t23\t25\t26\t0\n"
DEBUG [server] Accessing script directory at path "/app/scripts"
DEBUG [server] Rscript stderr:

processing file: Librarian_analysis.Rmd
Error in file(con, "w") : cannot open the connection
Calls: <Anonymous> -> <Anonymous> -> write_utf8 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file 'Librarian_analysis.knit.md': Read-only file system
Execution halted

DEBUG [server] Rscript stdout: 1/11
2/11 [original script]
3/11
4/11 [sample names]
5/11
6/11 [compositions plot]
7/11
8/11 [probability maps]
9/11
10/11 [heatmap]
11/11

ERROR [server] Rscript failed with status exit status: 1

What's the proper way to analyze a Paired-End sample? I see in the heatmap file, sample_name_01 and sample_name_02. 1) Does librarian not parse the sample name from the FASTQ filenames? If not, how do I configure the sample name(s) correctly? 2) This example is Read1 and Read2 files, not separate samples. I'm guessing that librarian will ALWAYS make a 1-to-1 relation for Files-to-Samples. So should my Paired-End samples write interleaved FASTQs?

I'll share the heatmap results here:

sample_name ChIP-Seq    RIP-Seq RNA-Seq ssRNA-Seq   ncRNA-Seq   Hi-C    MBD-Seq MeDIP-Seq   Bisulfite-Seq   ATAC-Seq    DNase-HS    miRNA-Seq   MNase-Seq   ChIA-PET
sample_name_01  0   0   100 0   0   0   0   0   0   0   0   0   0   0
sample_name_02  0   20.386548238278067  23.97975253691262   28.152852329050663  10.538500871837146  7.0888477087537645  0   0   0   0   0   9.85349831516773200

Looks to me that the Read1 file was pegged as RNA-Seq (with 100 confidence). I would've expected Read2 to get similar results, but maybe the issue is that the reads are reverse complimented. Should I only analyze Read1s?

kartva commented 11 months ago

@ChristelKrueger may be better able to answer your queries about analyzing a paired-end sample.

I suspect that RMarkdown tries to use the directory of the script alongside writing the result to the provided path. Is it possible for you to set the scripts directory to read-write? (as far I as I recall, the scripts directory is located in the same directory as the executable)

ChristelKrueger commented 11 months ago

Hi @a-frantz, thank you for your continued interest in Librarian, your testing is very helpful for us!

The training data for Librarian consisted exclusively of R1 for paired end data. The reason for this is that as you say R1 and R2 can be very different, bisulfite libraries being a particularly stark example. Thank you for asking this, we need to be much more explicit about this in the new version of the manuscript and the How To's that will go with it. (Hopefully resubmitted very soon.) For your tests please only use R1. Indeed, all input files are treated as separate samples.

ChristelKrueger commented 11 months ago

Regarding your other question: Librarian should be parsing the FASTQ filenames, I'm a bit puzzled how you ended up with sample_name_01 as these are simply the fictional filenames we had used for our tests. Can you confirm that this librarian_heatmap.txt file was indeed generated from your samples and is not our example output file (which is on github)?

a-frantz commented 11 months ago

I suspect that RMarkdown tries to use the directory of the script alongside writing the result to the provided path. Is it possible for you to set the scripts directory to read-write? (as far I as I recall, the scripts directory is located in the same directory as the executable)

@DesmondWillowbrook Yes, it is technically possible, but it goes against how Singularity is "meant" to be run. By default, Singularity file-systems are Read Only. There is a flag to make the file-system writeable, but 1) it's considered dangerous to use and 2) unfortunately WDL has not yet released a feature for enabling this kind of setting at a per-task level. i.e. If I wanted my pipeline to call librarian with the writeable flag, I would also have to enable writeable status in all my Singularity images. So long story short, I do not have the ability to set /app/scripts/ to R-W 😮‍💨

@ChristelKrueger Thanks for the explanation! That makes perfect sense! I'll rework my implementation to only accept one file and document your requirements. I could not find any example files in your repo that match my "result" file. It appeared in the output directory, so I think it's legit. Maybe the issue is that I supplied similarly named files (only differed on R1 vs R2), and so whatever filename cleaning you have caused a clash, and it fell back on the generic sample names? Just a guess. I can test what happens if I just pass one FASTQ, and see if it gets the right name 👍

ChristelKrueger commented 11 months ago

Thanks for checking. We've done a bit of digging ourselves and found where the sample_name_1 came from - it's us. We'll fix it, it should be the file name.

a-frantz commented 10 months ago

@DesmondWillowbrook Any progress here? Really appreciate the work yall are putting in here!

kartva commented 10 months ago

I have a patch written that I will be pushing soon to carry over sample names to the final report.

As for the read-only filesystem issue, I will investigate that as well. If you have any tips or insights into how RMarkdown can be prevented from doing that, please feel free to share.

kartva commented 10 months ago

I have pushed said patch (1f419f05ddce215a34f86b756919cc05f37ec74b) and the corresponding Dockerfile.

a-frantz commented 10 months ago

@DesmondWillowbrook I've never used RMarkdown before, so I'm not sure I can be much with that query.

However I do have an idea. How does the raw data report, rendered in MultiQC, compare to what is produced by RMarkdown?

Because of legalities with data sharing etc. it is not clear at this point in time if our typical end user will have access to your "fancy" output files. They may be limited to interacting with the data through generated MultiQC reports. Regardless of whether the higher-ups in my department clear sharing of your other files, we intend for the main interaction to exist entirely within MultiQC reports, and potentially MultiQC "data directories".

This is to say, for our use-case, maybe we can just disable the RMarkdown output generation? It appears to me that "the final report" is being produced correctly, and I believe that is all that's needed by MultiQC. Then we just don't need any other outputs.

Is my understanding correct?

Maybe you can add a --raw-only parameter that will allow us to skip the problematic RMarkdown script?

ChristelKrueger commented 10 months ago

To chip in here, MultiQC picks up the file librarian_heatmap.txt produced by the R script which contains the percentage values displayed as a heatmap.

s-andrews commented 10 months ago

I did some testing on the docker container and I think you can fix the problem with the markdown rendering by simply setting the TMPDIR variable to point to /app/out so it uses that rather than the internal /tmp which caused the problems.

Running:

docker run  -v `pwd`/frontend/example_inputs/example_inputs/:/app/in  -v `pwd`/out:/app/out  -e RUST_LOG='trace' -e TMPDIR='/app/out'  -t ghcr.io/desmondwillowbrook/librarian /app/in/RNA.example.fastq

Generates

/usr/bin/pandoc +RTS -K512m -RTS Librarian_analysis.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /app/out/Librarian_analysis.html --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --template /usr/local/lib/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /app/out/Rtmp08P1YB/rmarkdown-strb44d5f1a.html

..so /app/out does seem to be being recognised as the location for markdown temp files.

a-frantz commented 10 months ago

This is strange. In my test run, TMPDIR didn't seem to be respected. First, the code used to run:

task librarian {
    input {
        File read_one_fastq
        String prefix = sub(
            basename(read_one_fastq),
            "([_\.][rR][12])?(\.subsampled)?\.(fastq|fq)(\.gz)?$",
            ""
        )
        Int modify_disk_size_gb = 0
        Int max_retries = 1
    }

    Float read1_size = size(read_one_fastq, "GiB")
    Int disk_size_gb = (
        ceil(read1_size) + 10 + modify_disk_size_gb
    )

    command <<<
        set -euo pipefail

        mkdir tmp
        export TMPDIR=$(pwd)/tmp
        RUST_LOG=trace /app/librarian --local -o ~{prefix} ~{read_one_fastq}
    >>>

    # output {
    #     File report =
    # }

    runtime {
        memory: "4 GB"
        disk: "~{disk_size_gb} GB"
        docker: 'ghcr.io/desmondwillowbrook/librarian:latest'
        maxRetries: max_retries
    }
}

stderr:

2023-11-22 11:12:52.665 wdl.t:librarian.stderr thread 'main' panicked at cli/src/bin/librarian.rs:185:51:
2023-11-22 11:12:52.665 wdl.t:librarian.stderr R script should be successful: R script exited unsuccessfully
2023-11-22 11:12:52.665 wdl.t:librarian.stderr note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

STDOUT (skipping beginning of logs as they appear successful and are verbose):

...
INFO  [librarian] Running locally, using workdir "/mnt/miniwdl_task_container/work/SRR23034262"
DEBUG [server] Input: "sample_01\tsample_SRR23034262.R1.fastq.gz\t20\t33\t30\t15\t0\t18\t26\t29\t26\t0\t23\t26\t22\t27\t0\t30\t24\t22\t22\t0\t31\t20\t26\t21\t0\t35\t21\t22\t20\t0\t20\t24\t21\t33\t0\t22\t29\t23\t24\t0\t21\t28\t23\t25\t0\t33\t21\t23\t21\t0\t25\t28\t27\t18\t0\t23\t28\t25\t23\t0\t25\t25\t23\t25\t0\t25\t25\t24\t24\t0\t25\t25\t24\t24\t0\t25\t26\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t25\t23\t0\t26\t25\t24\t23\t0\t26\t25\t23\t24\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t25\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t26\t24\t23\t0\t26\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t25\t24\t23\t0\t26\t26\t24\t23\t0\t26\t26\t24\t23\t0\t26\t25\t24\t24\t0\t25\t25\t24\t23\t0\t25\t25\t24\t23\t0\t26\t25\t24\t23\t0\t25\t26\t24\t23\t0\t25\t25\t24\t23\t0\n"
DEBUG [server] Accessing script directory at path "/app/scripts"
DEBUG [server] Rscript stderr:

processing file: Librarian_analysis.Rmd
Error in file(con, "w") : cannot open the connection
Calls: <Anonymous> -> <Anonymous> -> write_utf8 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file 'Librarian_analysis.knit.md': Read-only file system
Execution halted

DEBUG [server] Rscript stdout: 1/11
2/11 [original script]
3/11
4/11 [sample names]
5/11
6/11 [compositions plot]
7/11
8/11 [probability maps]
9/11
10/11 [heatmap]
11/11

ERROR [server] Rscript failed with status exit status: 1

Work dir:

$ tree /research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/workflows_test/librarian/20231122_111113_librarian/work/
/research_jude/rgs01_jude/groups/zhanggrp/projects/SJCloud/common/workflows_test/librarian/20231122_111113_librarian/work/
|-- SRR23034262
|   |-- Librarian_analysis_files
|   |   `-- figure-html
|   |       |-- compositions\ plot-1.png
|   |       |-- heatmap-1.png
|   |       `-- probability\ maps-1.png
|   |-- compositions_map.png
|   |-- compositions_map.svg
|   |-- librarian_heatmap.txt
|   |-- prediction_plot.png
|   |-- prediction_plot.svg
|   |-- probability_maps.png
|   `-- probability_maps.svg
|-- _miniwdl_inputs
|   `-- 0
|       `-- SRR23034262.R1.fastq.gz
`-- tmp

6 directories, 11 files

kartva commented 10 months ago

To rule out the obvious problem: are you sure that you are running the latest build?

a-frantz commented 10 months ago

To rule out the obvious problem: are you sure that you are running the latest build?

Didn't hurt to check, but no dice. Getting the same error. I just did a fresh pull of ghcr.io/desmondwillowbrook/librarian:latest. I can't seem to find a reference to Librarian_analysis.knit.md in your source code. Does this file matter?

From @ChristelKrueger :

To chip in here, MultiQC picks up the file librarian_heatmap.txt produced by the R script which contains the percentage values displayed as a heatmap.

This file is being generated without problems. So I'll reiterate my question from before:

This is to say, for our use-case, maybe we can just disable the RMarkdown output generation? It appears to me that "the final report" is being produced correctly, and I believe that is all that's needed by MultiQC. Then we just don't need any other outputs.

Is my understanding correct?

Maybe you can add a --raw-only parameter that will allow us to skip the problematic RMarkdown script?

How comprehensive is your integration into MultiQC? We use it extensively. I've found some/most tools have fully integrated into MultiQC, and there is never a need to review the "raw" "single sample" output files of the tool itself. Every piece of data is present in the MultiQC data directories. Other tools only end up parsing certain bits of their results, meaning you may still need to review the "raw single sample" files. This is usually a function of how much data is being produced, as MultiQC enforces limits on runtime+memory+etc.

So I'm wondering if we can simply skip all these additional outputs, and just rely on what is parsed by MultiQC.