mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
20 stars 9 forks source link

Could not locate k8 and/or paftools.js in the minimap2 folder, they are required for converting annotation to bed12 files #26

Closed nick-youngblut closed 7 months ago

nick-youngblut commented 7 months ago

If minimap2 is installed in a linux OS via apt-get, the executable is located at /usr/bin/minimap2, and so "k8 and/or paftools.js" are not located in that directory.

It appears that the linux package does not include paftools or k8:

$ dpkg -L minimap2
/.
/usr
/usr/bin
/usr/bin/minimap2
/usr/share
/usr/share/doc
/usr/share/doc/minimap2
/usr/share/doc/minimap2/changelog.Debian.gz
/usr/share/doc/minimap2/copyright
/usr/share/doc/minimap2/minimap2.pdf
/usr/share/doc/minimap2/run-unit-test
/usr/share/doc/minimap2/test
/usr/share/doc/minimap2/test/MT-human.fa.gz
/usr/share/doc/minimap2/test/MT-orang.fa.gz
/usr/share/doc/minimap2/test/q-inv.fa.gz
/usr/share/doc/minimap2/test/q2.fa
/usr/share/doc/minimap2/test/t-inv.fa.gz
/usr/share/doc/minimap2/test/t2.fa
/usr/share/doc/minimap2/test_script
/usr/share/doc-base
/usr/share/doc-base/minimap2.minimap2
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/minimap2.1.gz

So, it would be helpful to warn users that minimap2 installed via apt-get does not work with FLAMES -- at least unless K8 or paftools.js is installed separately.

I'm using FLAMES 1.8.0.

ChangqingW commented 7 months ago

The pipeline functions now take arguments minimap2 and k8, the paftools.js script is included in FLAMES. Hopefully this will make it easier to run on Ubuntu.

If you would like to install the latest commit you will need to either have bioconductor in development mode (which requires R 4.4) or apply the following patch with git apply patch.txt to remove changes specific to next biocondcutor release:

diff --git a/DESCRIPTION b/DESCRIPTION
index fcb769f..cfc1578 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -39,7 +39,6 @@ Imports:
     DropletUtils,
     GenomicRanges,
     GenomicFeatures,
-    txdbmaker,
     GenomicAlignments,
     GenomeInfoDb,
     ggplot2,
diff --git a/NAMESPACE b/NAMESPACE
index 561d2c6..2fc6eec 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -46,6 +46,7 @@ importFrom(GenomeInfoDb,seqlengths)
 importFrom(GenomicAlignments,readGAlignments)
 importFrom(GenomicAlignments,seqnames)
 importFrom(GenomicFeatures,extractTranscriptSeqs)
+importFrom(GenomicFeatures,makeTxDbFromGFF)
 importFrom(GenomicFeatures,transcripts)
 importFrom(GenomicRanges,GRanges)
 importFrom(GenomicRanges,GRangesList)
@@ -167,8 +168,6 @@ importFrom(tidyr,as_tibble)
 importFrom(tidyr,gather)
 importFrom(tidyr,pivot_longer)
 importFrom(tidyr,pivot_wider)
-importFrom(txdbmaker,makeTxDbFromGFF)
-importFrom(txdbmaker,makeTxDbFromGRanges)
 importFrom(utils,file_test)
 importFrom(utils,modifyList)
 importFrom(utils,read.csv)
diff --git a/R/find_isoform.R b/R/find_isoform.R
index bc5db79..9f7f877 100644
--- a/R/find_isoform.R
+++ b/R/find_isoform.R
@@ -141,8 +141,7 @@ find_isoform_flames <- function(annotation, genome_fa, genome_bam, outdir, confi
 #' @return Path to the outputted transcriptome assembly
 #'
 #' @importFrom Biostrings readDNAStringSet writeXStringSet
-#' @importFrom GenomicFeatures extractTranscriptSeqs
-#' @importFrom txdbmaker makeTxDbFromGFF
+#' @importFrom GenomicFeatures extractTranscriptSeqs makeTxDbFromGFF
 #' @importFrom Rsamtools indexFa
 #' @importFrom utils write.table
 #'
@@ -172,7 +171,7 @@ annotation_to_fasta <- function(isoform_annotation, genome_fa, outdir, extract_f

   dna_string_set <- Biostrings::readDNAStringSet(genome_fa)
   names(dna_string_set) <- gsub(" .*$", "", names(dna_string_set))
-  txdb <- txdbmaker::makeTxDbFromGFF(isoform_annotation)
+  txdb <- GenomicFeatures::makeTxDbFromGFF(isoform_annotation)
   if (missing(extract_fn)) {
     tr_string_set <- GenomicFeatures::extractTranscriptSeqs(dna_string_set, txdb,
       use.names = TRUE)
diff --git a/R/model_decay.R b/R/model_decay.R
index 075cee0..d797339 100644
--- a/R/model_decay.R
+++ b/R/model_decay.R
@@ -4,7 +4,6 @@
 #' that only differ by the 5' / 3' end. This could be useful for plotting average
 #' coverage plots.
 #' 
-#' @importFrom txdbmaker makeTxDbFromGFF makeTxDbFromGRanges
 #' @importFrom rtracklayer import
 #' @importFrom S4Vectors split
 #' @importFrom GenomicRanges strand
@@ -27,11 +26,11 @@
 filter_annotation <- function(annotation, keep = "tss_differ") {
   if (is.character(annotation)) {
     annotation <- annotation |>
-      txdbmaker::makeTxDbFromGFF() |>
+      GenomicFeatures::makeTxDbFromGFF() |>
       GenomicFeatures::transcripts()
   } else {
     annotation <- annotation |>
-      txdbmaker::makeTxDbFromGRanges() |>
+      GenomicFeatures::makeTxDbFromGRanges() |>
       GenomicFeatures::transcripts()
   }

@@ -55,7 +54,7 @@ filter_annotation <- function(annotation, keep = "tss_differ") {
 #' @description Plot the average read coverages for each length bin or a 
 #' perticular isoform
 #' 
-#' @importFrom GenomicFeatures transcripts
+#' @importFrom GenomicFeatures makeTxDbFromGFF transcripts
 #' @importFrom GenomicAlignments readGAlignments seqnames 
 #' @importFrom GenomicRanges width strand granges coverage
 #' @importFrom Rsamtools ScanBamParam
nick-youngblut commented 7 months ago

Thanks @ChangqingW for making the updates!

bioconductor in development mode (which requires R 4.4)

This is a good example of how Bioconductor just adds unneeded complexity to package management in R.

I'm using R 4.3.1, and I have no plans on recreating my Docker environment with R 4.4 (the build for rocker/rstudio + Seurat + FLAMES takes nearly an hour). When do you plan on submitting a new release to bioconductor?

ChangqingW commented 7 months ago

Bioconductor's next release is scheduled on May 1st: https://bioconductor.org/developers/release-schedule/

nick-youngblut commented 7 months ago

Thanks for letting me know. I'm guessing that I'll have to update R just to use the updated version of bioconductor.

I tried to just use BiocManager::install("mritchielab/FLAMES") with R 4.3.1, which resulted in the failed install of txdbmaker, which is only available for Bioconductor 3.19. I'm guessing this is why you state that R 4.4 is needed.

The convoluted dependency trees and releases separate to CRAN can really make bioconductor a pain (what's wrong with good-old CRAN?).

ChangqingW commented 7 months ago

You can try cloning to a local folder, apply the diff I posted to remove txdbmaker stuff and install from the local folder. git clone https://github.com/mritchielab/FLAMES.git && cd FLAMES (save the patch file somewhere) git apply path/to/patch/file Then, in R: remotes::install_local("path/to/cloned/FLAMES", force = T)

nick-youngblut commented 7 months ago

Thanks @ChangqingW for the suggestion! remotes doesn't always work well for bioconductor packages, but I'll give it a try.

nick-youngblut commented 7 months ago

Does txdbmaker really have to be a required dependency? It appears to only be available for only available for Bioconductor 3.19, but that version hasn't even been fully released yet. The same for R 4.4. No everyone is on the bleeding edge of R & Bioconductor. For instance rocker doesn't even have any docker containers for R 4.4.

Would it be possible to remove txdbmaker from Imports:? Note: including txdbmaker in Imports: negates the inclusion of txdbmaker in Suggests:.

ChangqingW commented 7 months ago

You can indeed, that is what the git patch file is doing. We can't do it for this branch as it is the devel branch, which is built and checked with Bioc devel. As for release branches, we cannot introduce API changes in minor version bumps as per bioc guidelines, and the next major version bump would be May 1st according to Bioc's schedule, where the devel updates will be merged.

You can also try my own fork BiocManager::install("ChangqingW/FLAMES-R", ref = "devel_R_4_3", force = T), which has the patch applied and is not synced with any Bioc branches.

nick-youngblut commented 7 months ago

Patching and running remotes::install_local("path/to/cloned/FLAMES", force = T) worked. Thanks 👍

nick-youngblut commented 6 months ago

I've switched to different computing infrastructure in which software much be installed in conda environments.

Installing FLAMES via:

Results in the following error:

── R CMD build ────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/tmp/Rtmp3pht5z/file15a958461b6c75/FLAMES/DESCRIPTION’ ...
─  preparing ‘FLAMES’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘FLAMES_1.9.2.tar.gz’

ERROR: dependency ‘scater’ is not available for package ‘FLAMES’

Even with the patch, scater is still a dependency:

Imports:
    basilisk,
    bambu,
    Biostrings,
    BiocGenerics,
    circlize,
    ComplexHeatmap,
    cowplot,
    dplyr,
    DropletUtils,
    GenomicRanges,
    GenomicFeatures,
    GenomicAlignments,
    GenomeInfoDb,
    ggplot2,
    ggbio,
    grid,
    gridExtra,
    igraph,
    jsonlite,
    magrittr,
    Matrix,
    parallel,
    reticulate,
    Rsamtools,
    rtracklayer,
    RColorBrewer,
    SingleCellExperiment,
    SummarizedExperiment,
    scater,
nick-youngblut commented 6 months ago

In another attempt with less usage of conda:

The error:

ERROR: dependencies ‘rtracklayer’, ‘biomaRt’ are not available for package ‘GenomicFeatures’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/GenomicFeatures’
ERROR: dependencies ‘ggplot2’, ‘patchwork’ are not available for package ‘ggstats’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/ggstats’
ERROR: dependencies ‘GenomicFeatures’, ‘rtracklayer’ are not available for package ‘ensembldb’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/ensembldb’
ERROR: dependencies ‘SummarizedExperiment’, ‘rtracklayer’, ‘BSgenome’, ‘GenomicFeatures’ are not available for package ‘VariantAnnotation’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/VariantAnnotation’
ERROR: dependencies ‘ggplot2’, ‘ggstats’ are not available for package ‘GGally’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/GGally’
ERROR: dependency ‘GenomicFeatures’ is not available for package ‘OrganismDbi’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/OrganismDbi’
ERROR: dependencies ‘SummarizedExperiment’, ‘BSgenome’, ‘GenomicAlignments’, ‘GenomicFeatures’, ‘xgboost’ are not available for package ‘bambu’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/bambu’
ERROR: dependencies ‘Hmisc’, ‘SummarizedExperiment’, ‘GenomicAlignments’, ‘GenomicFeatures’, ‘VariantAnnotation’, ‘ensembldb’ are not available for package ‘biovizBase’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/biovizBase’
ERROR: dependencies ‘ggplot2’, ‘Hmisc’, ‘biovizBase’, ‘SummarizedExperiment’, ‘GenomicAlignments’, ‘BSgenome’, ‘VariantAnnotation’, ‘rtracklayer’, ‘GenomicFeatures’, ‘OrganismDbi’, ‘GGally’, ‘ensembldb’ are not available for package ‘ggbio’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/ggbio’

The downloaded source packages are in
    ‘/tmp/RtmpsIf7lY/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
── R CMD build ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/tmp/RtmpsIf7lY/file15bbd932199be9/FLAMES/DESCRIPTION’ ...
─  preparing ‘FLAMES’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘FLAMES_1.9.2.tar.gz’

ERROR: dependencies ‘basilisk’, ‘bambu’, ‘cowplot’, ‘DropletUtils’, ‘GenomicFeatures’, ‘GenomicAlignments’, ‘ggplot2’, ‘ggbio’, ‘igraph’, ‘Matrix’, ‘reticulate’, ‘rtracklayer’, ‘SingleCellExperiment’, ‘SummarizedExperiment’, ‘scater’, ‘scuttle’, ‘scran’, ‘MultiAssayExperiment’ are not available for package ‘FLAMES’
* removing ‘/home/nickyoungblut/miniforge3/envs/flames/lib/R/library/FLAMES’
There were 50 or more warnings (use warnings() to see the first 50)

Given that the are nearly 250 total R packages as dependencies for FLAMES, I can see why it can be tricky to install all of them without any errors.

nick-youngblut commented 6 months ago

I didn't see that there is a FLAMES bioconda recipe, so the install become quite simple and much faster:

mamba create -n flames bioconductor-flames minimap2 k8
nick-youngblut commented 5 months ago

Even with the patch, I'm getting:

07:09:56 PM Thu Jun 06 2024 Start running
Running BLAZE to generate barcode list from long reads...
$`output-prefix`
[1] "/large_experiments/multiomics/SspArc0008_10x_cDNA_longRead//FLAMES/sc_3end//"

$`output-fastq`
[1] "matched_reads.fastq"

$threads
[1] 8

$`max-edit-distance`
[1] 2

$overwrite
[1] TRUE

+ [/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda) create --yes --prefix [/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env) 'python=3.10' --quiet -c conda-forge -c bioconda -c defaults

+ [/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda) install --yes --prefix [/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env) 'python=3.10' -c conda-forge -c bioconda -c defaults

+ [/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/0/bin/conda) install --yes --prefix [/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env](http://localhost:9955/home/nickyoungblut/.cache/R/basilisk/1.14.1/FLAMES/1.9.2/flames_env) -c conda-forge -c bioconda -c defaults 'python=3.10' 'python=3.10' 'numpy=1.25.0' 'scipy=1.11.1' 'pysam=0.21.0' 'cutadapt=4.4' 'tqdm=4.64.1' 'pandas=1.3.5'

Running BLAZE...
Argument:  --expect-cells  8000 --overwrite --minimal_stdout  --output-prefix /large_experiments/multiomics/SspArc0008_10x_cDNA_longRead//FLAMES/sc_3end// --output-fastq matched_reads.fastq --threads 8 --max-edit-distance 2 /large_experiments/multiomics/SspArc0008_10x_cDNA_longRead//ont-proc_output/final/fastq_test_10k 
07:15:12 PM Thu Jun 06 2024 Demultiplex done
Running FLAMES pipeline...
#### Input parameters:
{
  "pipeline_parameters": {
    "seed": [2022],
    "threads": [8],
    "do_barcode_demultiplex": [true],
    "do_gene_quantification": [true],
    "do_genome_alignment": [true],
    "do_isoform_identification": [true],
    "bambu_isoform_identification": [false],
    "multithread_isoform_identification": [true],
    "do_read_realignment": [true],
    "do_transcript_quantification": [true]
  },
  "barcode_parameters": {
    "max_bc_editdistance": [2],
    "max_flank_editdistance": [8],
    "pattern": {
      "primer": ["CTACACGACGCTCTTCCGATCT"],
      "BC": ["NNNNNNNNNNNNNNNN"],
      "UMI": ["NNNNNNNNNNNN"],
      "polyT": ["TTTTTTTTT"]
    },
    "TSO_seq": ["CCCATGTACTCTGCGTTGATACCACTGCTT"],
    "TSO_prime": [3],
    "full_length_only": [false]
  },
  "isoform_parameters": {
    "generate_raw_isoform": [false],
    "max_dist": [10],
    "max_ts_dist": [100],
    "max_splice_match_dist": [10],
    "min_fl_exon_len": [40],
    "max_site_per_splice": [3],
    "min_sup_cnt": [5],
    "min_cnt_pct": [0.001],
    "min_sup_pct": [0.2],
    "bambu_trust_reference": [true],
    "strand_specific": [0],
    "remove_incomp_reads": [4],
    "downsample_ratio": [1]
  },
  "alignment_parameters": {
    "use_junctions": [true],
    "no_flank": [false]
  },
  "realign_parameters": {
    "use_annotation": [true]
  },
  "transcript_counting": {
    "min_tr_coverage": [0.4],
    "min_read_coverage": [0.4]
  }
} 
gene annotation: /large_experiments/multiomics/references/FLAMES/refdata-gex-GRCm39-2024-A/genes/genes.gtf 
genome fasta: /large_experiments/multiomics/references/FLAMES/refdata-gex-GRCm39-2024-A/fasta/genome.fa 
input fastq: /large_experiments/multiomics/SspArc0008_10x_cDNA_longRead//FLAMES/sc_3end//matched_reads.fastq 
output directory: /large_experiments/multiomics/SspArc0008_10x_cDNA_longRead//FLAMES/sc_3end/ 
minimap2 path: 
k8 path: 
#### Aligning reads to genome using minimap2
07:15:12 PM Thu Jun 06 2024 minimap2_align
Error in minimap2_align(config, genome_fa, infq, annotation, outdir, minimap2, : k8 not found, please make sure it is installed and provide its path as the k8 argument
Traceback:

1. sc_long_pipeline(fastq = fastq_dir, annotation = ref_annot_file, 
 .     genome_fa = ref_genome_file, outdir = outdir, config_file = config_file, 
 .     expect_cell_number = 8000)
2. minimap2_align(config, genome_fa, infq, annotation, outdir, minimap2, 
 .     k8, prefix = NULL, threads = config$pipeline_parameters$threads)
3. stop("k8 not found, please make sure it is installed and provide its path as the k8 argument")

k8 is in my PATH for my conda env. See https://github.com/mritchielab/FLAMES/issues/34#issuecomment-2153219282 for how I'm currently conducting the install.