pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
655 stars 172 forks source link

lr-kallisto dorado unaligned bam files #450

Open MustafaElshani opened 3 months ago

MustafaElshani commented 3 months ago

Hi

It is great to hear the support for long-reads coming to kallisto I would like to introduce it to our pipelines

I have few question regarding running lr-kallisto on ONT dorado basecalled reads.

  1. my first step is kallisto bus --long -x bulk -i "$INDEX_PATH" -o "$OUTPUT_DIR/$SAMPLE_NAME" --bam "$BAM_FILE" -t $SLURM_NTASKS this is using the dorado .bam output file, is this correct? or should I use fastq files from the likes of pychopper orientated full length files. I attempted to run --bam is through an error?

    Error: in order to use BAM, must compile with BAM option enabled
    Threshold not in (0,1). Setting default threshold for unmapped kmers to 0.8
    Error: --bam not supported in this mode
  2. My second step bustools sort -t $SLURM_NTASKS -o "$OUTPUT_DIR/$SAMPLE_NAME/sorted.bus" "$OUTPUT_DIR/$SAMPLE_NAME/output.bus" bustools count -o "$OUTPUT_DIR/$SAMPLE_NAME/count" -g "$GTF_PATH" -e "$OUTPUT_DIR/$SAMPLE_NAME/matrix.ec" -t "$OUTPUT_DIR/$SAMPLE_NAME/transcripts.txt" --cm "$OUTPUT_DIR$ is this correct ?

3.Third step being kallisto quant-tcc -i "$INDEX_PATH" -o "$OUTPUT_DIR/$SAMPLE_NAME" --long -P ONT --gtf "$GTF_PATH" --matrix-to-files -t $SLURM_NTASKS "$OUTPUT_DIR/$SAMPLE_NAME/count.mtx"

Is this the correct approach ? Additionally if I had 10x Genomics Visuim ONT reads can I process these using the -x Visium?

Yenaled commented 3 months ago

You need to compile with -DUSE_BAM=ON in cmake. I haven’t throughly tested whether BAM works though, so FASTQ is the safer option in terms of avoiding bugs.

Your other commands seem fine.

I don’t think -x VISIUM works because that requires barcodes and UMIs to be at fixed positions in R1 with the sequence to be mapped being in R2 — ONT data doesn’t look like that.

MustafaElshani commented 3 months ago

Thank you that's great I think I will be using pychopper fastq as I have already come into a bug. while building

 make
[  2%] Performing configure step for 'htslib'
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
configure: error: cannot find install-sh, install.sh, or shtool in "." "./.." "./../.."
make[2]: *** [CMakeFiles/htslib.dir/build.make:92: /home/m/scratch/bioinformatic_tools/kallisto/ext/htslib/src/htslib-stamp/htslib-configure] Error 1

It would have been good if VISIUM long reads would have been supported maybe something for the future

Regards

Mustafa

Yenaled commented 3 months ago

You might be able to get it to work — I can compile htslib just fine but I think I needed to use the right C compiler or build in a docker. It’s a bit tricky.

For VISIUM, you can probably get it to work — I’m just not familiar with the read structure.

bound-to-love commented 3 months ago

Hi! lr-kallisto is designed to run directly with fastq files. Please let me know if you run into any other issues!

bound-to-love commented 3 months ago

Apologies! I hadn't seen your other questions! I think you are missing the genemap for the -g flag which can be generated by kb ref. For the visium ont, the -x visium is for paired end reads, so it won't directly work. I'd be interested to hear more about your visium ont data though! I'd be interested in adding the processing steps for it to our pipelines!

hyeon9 commented 2 months ago

Hi! I'm wondering if the '-x 10xv2 or 10xv3' options are also for paired end reads.