mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
20 stars 9 forks source link

not detecting .bai in directory #18

Closed m-noonan closed 8 months ago

m-noonan commented 1 year ago

Hello, I have several questions here, but first, I was running the sc_long_pipeline and got this message:

04:12:28 AM Mon Nov 06 2023 Demultiplex done Running FLAMES pipeline... Error in sc_long_pipeline(fastq = fastq, genome_bam = bam, outdir = output, : Please make sure the BAM file is indexed Execution halted

Here is my code, where 'bam' is to a directory with all my .bam and .bai files (all bam files are indexed).

library(FLAMES) setwd("/home/data2/megan/sclr_data/flames_out/flames_redo_bams/") barcodes = "/home/data2/megan/sclr_data/mousekid/fastq_pass/fastq_pass.whitelist.tsv" fastq = "/home/data2/megan/sclr_data/all_pass_fastq/merge_allpass.fastq.gz" bam = "/home/data2/megan/sclr_data/mousekid/fastq_pass/bams/" output = "/home/data2/megan/sclr_data/flames_out/flames_redo_bams/" GTF = "/home/users/mnoonan/refdata-gex-mm10-2020-A/genes/genes.gtf" genome = "/home/users/mnoonan/refdata-gex-mm10-2020-A/fasta/genome.fa" minimap2_dir="/home/data/Megan/anaconda3/envs/minimap2/bin/" sce_flames_redo <- sc_long_pipeline(fastq = fastq, genome_bam = bam, outdir = output, annotation = GTF, genome_fa = genome, minimap2_dir = minimap2_dir, expect_cell_number = 10000, barcodes_file = barcodes) save.image(file = "flames_redo_output.Rda")

Is it because it is a directory? Do I need to reference the individual .bam/.bai files? Or should I run without the bam files for now? Also, am I able to resume the run where it left off since it said "execution halted"? Just getting to this point took over a week. If not, how can I change the number of threads used or another way to make the pipeline run faster? Thanks for any help or advice you may have!

ChangqingW commented 1 year ago

Is it because it is a directory?

Yes. This function could not handle multiple BAM files. The sc_long_multisample_pipeline can handle multiple samples where each sample have one BAM file and is named specifically as smapleName_align2genome.bam. Do you have multiple BAM files because you have multiple samples?

Also, am I able to resume the run where it left off since it said "execution halted"? Just getting to this point took over a week. if not, how can I change the number of threads used or another way to make the pipeline run faster?

Did it took weeks just to finish demultiplexing? Yes you can skip demultiplexing again. And yes both demultiplexing and alignment can use multiple threads. Just put change "do_barcode_demultiplex": true to "do_barcode_demultiplex": false in the config file (you can create one with create_config(outdir, type = "sc_3end")). You can change threads" : 1, to use more threads during alignment and dumultiplexing.

youyupei commented 1 year ago

Hi @m-noonan,

Just follow up on @ChangqingW's response, For the error, you are right, at the moment the sc_long_pipeline function takes a single BAM file as input, which means it was expecting a file, not a directory. Note that the BAM file is optional, if the reads in your existing BAM have not been demultiplex, I would suggest running without the BAM files and letting FLAMES redo the mapping (set "do_genome_alignment": true). Otherwise, you will have to set "do_gene_quantification": false because demultiplexed reads in BAM was required for the gene quantification step.