Various bugs - Githubissues

kokyriakidis commented 4 years ago

Hi! I noticed some bugs trying to run the pipeline!

1) cpsr toml file should have

vep_pick_order = "canonical,appris,tsl,biotype,ccds,rank,length,mane"

otherwise throws an error.

2) pcgr toml file should have

vep_pick_order = "canonical,appris,tsl,biotype,ccds,rank,length,mane"

the same.

3) Running the rmd steps give the following error:

processing file: cancer_report.Rmd
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/RED/umccrise/miniconda/envs/umccrise/lib/R/library/stringi/libs/stringi.so':
  libicui18n.so.64: cannot open shared object file: No such file or directory
Calls: <Anonymous> ... namespaceImport -> loadNamespace -> library.dynam -> dyn.load
Execution halted
MissingOutputException in line 116 of /RED/umccrise/umccrise/umccrise/rmd.smk:
Missing files after 5 seconds:
2016_249_18_WH_P017_2__CCR180159_VPT-WH017A/2016_249_18_WH_P017_2__CCR180159_VPT-WH017A_cancer_report.html
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /RED/umccrise/umccrised/.snakemake/log/2019-10-14T070152.787983.snakemake.log
--------
Error running Umccrise: snakemake returned a non-zero status. Working directory: /RED/umccrise/umccrised

The bug solved just running

conda activate umccrise
conda install -c r r-stringi

Then another error appeared:

Quitting from lines 102-120 (cancer_report.Rmd) 
Error in library(tx_ref_genome, character.only = TRUE) : 
  there is no package called 'TxDb.Hsapiens.UCSC.hg19.knownGene'
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> library

Installing

conda install -c bioconda bioconductor-txdb.hsapiens.ucsc.hg19.knowngene

DID NOT solve the problem

I had to modify the code in cancer_report.Rmd with the following code:

{r load_pkgs}
library(BSgenome)
library(devtools)
library(DT)
library(dplyr)
library(glue)
library(ggplot2)
library(knitr)
library(kableExtra)
library(MutationalPatterns)
library(readr)
library(rmarkdown)
library(stringr)
library(tidyr)
library(purrr)
ref_genome <- paste0("BSgenome.Hsapiens.UCSC.", params$genome_build)
library(ref_genome, character.only = TRUE)

tx_ref_genome <- paste0("TxDb.Hsapiens.UCSC.", params$genome_build, ".knownGene")
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")

library(tx_ref_genome, character.only = TRUE)

I had to first set the CRAN mirror, cause otherwise it throws an error. Now it works fine

4) The coverage step has bugs regarding cacao. There is a parameter in --ref-fasta that cannot be used in cacao and it throws an error. I could not run this step so I excluded it from the run.

vladsavelyev commented 4 years ago

Hi Konstantinos! Thanks so much for trying the pipeline and writing down the issues you had.

The vep_pick_order I'm using I selected on purpose - in order to prioritize transcripts based on the functional impact of the variants on them first ("rank"), then based on significance of the transcript according to the well curated APPRIS database ("appris" value goes second). In fact, the newer PCGR will defaults to this order. See https://github.com/sigven/pcgr/issues/79#issuecomment-509999130 for the discussions.
Same
I had this issue when installing into the Docker container, that's why I have these lines that install the genome separately on top of the conda environment. There is apparently a problem with conda packages for those libraries, so only installing them directly from bioconductor solves this. I should probably document this in readme.
I'm using a modified CACAO package, which has the --ref-fasta option: https://github.com/vladsaveliev/cacao/commit/650c5ba5295a3fbd325350ba58177dab5ecb5a96 It is needed for processing CRAM files. CACAO that is installed via conda should point to the modified package.

vladsavelyev commented 4 years ago

Just wondering how did you install PCGR, CPSR and CACAO? Per readme through conda?

kokyriakidis commented 4 years ago

First of all I had the following issue when trying the installation

source <(curl -s https://github.com/umccr/umccrise/blob/master/install.sh)
-bash: /dev/fd/63: line 7: syntax error near unexpected token `newline'
-bash: /dev/fd/63: line 7: `<!DOCTYPE html>'

So I just opened the file the run the commands line by line.

So I guess PCGR, CPSR and CACAO were installed via conda.

Also, I had big trouble getting the ref data, because some of the scripts do not work as they should so I had to modify them a bit.

One other big problem was the structure of the files. I had to dig into the paths.yaml in hpc_utils in order to understand the order of the files and their directories

COVERAGE does not work. It throws this error:

2019-10-14 08:18:17 - cacao-run - INFO - Start
2019-10-14 08:18:17 - cacao-run - INFO - Validating input files and command-line parameters
2019-10-14 08:18:17 - cacao-run - INFO - Running cacao workflow - assessment of coverage at actionable and pathogenic loci
usage: cacao.py [options] <BAM-or-CRAM> <BED_TARGET> <ALN_FNAME_HOST> <TARGET_FNAME_HOST> <BED_TRACK_DIRECTORY> OUTPUT_DIR> <GENOME_ASSEMBLY> <CANCER_MODE> <SAMPLE_ID> <MAPQ> <THREADS> <CALLABILITY_LEVELS_GERMLINE> <CALLABILITY_LEVELS_SOMATIC>
cacao.py: error: unrecognized arguments: --ref-fasta /workdir/ref.fa

MissingOutputException in line 80 of /RED/umccrise/umccrise/umccrise/coverage.smk:
Missing files after 5 seconds:
2016_249_18_WH_P017_2__CCR180159_VPT-WH017A/coverage/cacao_tumor/2016_249_18_WH_P017_2__CCR180159_VPT-WH017A_grch37_coverage_cacao.html
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /RED/umccrise/umccrised/.snakemake/log/2019-10-14T081805.138857.snakemake.log
--------
Error running Umccrise: snakemake returned a non-zero status. Working directory: /RED/umccrise/umccrised

The command used is:

umccrise '/RED/umccrise/umccrise_test_data/data/bcbio_test_project' pcgr structural small_variants rmd multiqc purple coverage

Also, I found the PON but I think they can be used only for GRCh37. Are there PON for hg38?

vladsavelyev commented 4 years ago

Fixed the install.sh issue now, that should be

source <(curl -s https://raw.githubusercontent.com/umccr/umccrise/master/install.sh)

Instead of:

source <(curl -s https://github.com/umccr/umccrise/blob/master/install.sh)

Regarding the reference data, I honestly just documented the commands I used to prepare the data, but never tried to reproduce. We run this pipeline internally and never tried to make it usable outside. I prepared the reference data once on our HPC, and then archived it and uploaded to s3, from where the AWS version just grabs it. Perhaps it's time I go through and try to install it from scratch assuming I don't have the reference files tarball. Or just share the preapred s3 tarball with public.

Regarding the panel of normals, it's built using this script, that iterates over the input list of bcbio runs and normal sample names (e.g. https://github.com/vladsaveliev/vcf_stuff/blob/master/vcf_stuff/panel_of_normals/normals.tsv) in order to find corresponding BAM files. Internally, it runs this snakemake workflow on a list of BAM files to generate the panel of normal VCF. The workflow runs Mutect2 in the panel-of-normals-creating mode. And because the samples we use are GRCh37-aligned, we had to produce the hg38 version by lifting over the resulting VCF file using CrossMap. However we plan to re-align our normal samples and properly re-build the hg38 from scratch.

Also keep in mind that our panel is built using the normals sequenced in our lab, aimed in handling common artefacts specific to our chemistry; it will not necessarily be efficient for other labs, at least not as much better than just population filtering with GnomAD. However, in my tests on COLO829 public benchmark it still was useful. But you might wanna build your own one if you have enough normal samples. You can reuse this workflow, or just run Mutect2 directly on your BAM files.

kokyriakidis commented 4 years ago

Thanks so much for the info and the good work!

———————————-

The conf files for PCGR or CPSR are in a folder in umccrise github repository (https://github.com/umccr/umccrise/tree/master/umccrise/pcgr), so they have to change manually in order to work. These are the files umccrise use when run without docker.

————————————

A reference file tarball would be perfect, so as the installation is quick and easy!

——————————

Regarding Coverage, did I install something from other source and have these issues?

——————————-

Do you, at umccr follow the same protocol as hartwick medical center regarding sample extraction? E.g. 30x tumor sample + sample for rnaseq and 90x Germline like blood? Is your protocol open for share to other institutions like the Medical department of my university?

kokyriakidis commented 4 years ago

Also, CNV had much more info in PCGR like targeted drugs, KEGG PATHWAY, if they are protooncogenes. In UMCCR report it is just a list. Can I enable CNV to be reported in PCGR? I am talking for those reported in UMCCR example report but not in PCGR.

ohofmann commented 4 years ago

umccrise is currently meant as an in-house framework to help us post-process our patient samples. It's open source for others to explore and check, but it is unlikely to work for your needs, or allow you the customisation you might be after.

kokyriakidis commented 4 years ago

@ohofmann Thanks for the info!

umccr / umccrise

Various bugs #34