pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
656 stars 172 forks source link

Using lr-kallisto for quantifying long-read bulk RNA-seq #456

Closed alanlamsiu closed 2 months ago

alanlamsiu commented 2 months ago

Hi kallisto team,

Could you point me to the documentation showing steps for running lr-kallisto for long-read bulk RNA-seq?

bound-to-love commented 2 months ago

See the —long options in the manual https://pachterlab.github.io/kallisto/manual . To run lr-kallisto on bulk reads use -x bulk —long. Note there is also a —long option for quant-tcc for quantifying at the transcript level. Also, note that you will need to build kallisto from source in a new directory with cmake -DMAX_KMER_SIZE=64 ..

bound-to-love commented 2 months ago

Hi, just wanted to add the note that kallisto index will need run with -k 63. Let me know if you run into any issues!

alanlamsiu commented 2 months ago

Hi @bound-to-love, thanks for the advice. I built kallisto 0.51.0 from source by setting DMAX_KMER_SIZE=64. When I checked the help manual of quant-tcc using command line, I got the following,

Quantifies abundance from pre-computed transcript-compatibility counts

Usage: kallisto quant-tcc [arguments] transcript-compatibility-counts-file

Required arguments:
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-i, --index=STRING            Filename for the kallisto index to be used
                              (required if file with names of transcripts not supplied)
-T, --txnames=STRING          File with names of transcripts
                              (required if index file not supplied)
-e, --ec-file=FILE            File containing equivalence classes
                              (default: equivalence classes are taken from the index)
-f, --fragment-file=FILE      File containing fragment length distribution
                              (default: effective length normalization is not performed)
--long                        Use version of EM for long reads
-P, --platform.               [PacBio or ONT] used for sequencing
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (note: -l, -s values only should be supplied when
                               effective length normalization needs to be performed
                               but --fragment-file is not specified)
-p, --priors                  Priors for the EM algorithm, either as raw counts or as
                              probabilities. Pseudocounts are added to raw reads to
                              prevent zero valued priors. Supplied in the same order
                              as the transcripts in the transcriptome
-t, --threads=INT             Number of threads to use (default: 1)
-g, --genemap                 File for mapping transcripts to genes
                              (required for obtaining gene-level abundances)
-G, --gtf=FILE                GTF file for transcriptome information
                              (can be used instead of --genemap for obtaining gene-level abundances)
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --matrix-to-files         Reorganize matrix output into abundance tsv files
    --matrix-to-directories   Reorganize matrix output into abundance tsv files across multiple directories
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext only, not HDF5

which looks different than the one in https://pachterlab.github.io/kallisto/manual. I have both the index and fastq files. What would be the steps in order to generate the transcript-compatibility-counts-file required by quant-tcc?

bound-to-love commented 2 months ago

Hello, as described in the manual "The necessary files can all be generated from bustools count in bustools." So the commands would be kallisto bus, bustools sort, bustools correct (optional), bustools count, kallisto quant-tcc with all the necessary inputs/outputs.

alanlamsiu commented 2 months ago

Hi @bound-to-love, I learned the steps from a kallisto issue posted by another user and here are the steps.

kallisto bus -i index_file -o outdir --long -x bulk sample.fastq.gz
bustools sort -o outdir outdir/output.bus
bustools count -o outdir/out -g sample.gtf -e outdir/matrix.ec -t outdir/transcripts.txt outdir/output.bus

I notice that the outdir/output.bus file is not empty. The beginning few lines of outdir/matrix.ec shows below. And the outdir/transcripts.txt file shows a list of transcript IDs as in the transcriptome fasta file used for making the index.

0       188087
1       128482,133273
2       129500
3       128482
4       128482,129500,130163,133273

However, directory outdir/out is empty. Could you advice on this issue? And could you give an example on how to run the quant-tcc command as I don't seem to find one in the manual?

Thanks.

icemduru commented 2 months ago

Hi @bound-to-love, I learned the steps from a kallisto issue posted by another user and here are the steps.

kallisto bus -i index_file -o outdir --long -x bulk sample.fastq.gz
bustools sort -o outdir outdir/output.bus
bustools count -o outdir/out -g sample.gtf -e outdir/matrix.ec -t outdir/transcripts.txt outdir/output.bus

I notice that the outdir/output.bus file is not empty. The beginning few lines of outdir/matrix.ec shows below. And the outdir/transcripts.txt file shows a list of transcript IDs as in the transcriptome fasta file used for making the index.

0       188087
1       128482,133273
2       129500
3       128482
4       128482,129500,130163,133273

However, directory outdir/out is empty. Could you advice on this issue? And could you give an example on how to run the quant-tcc command as I don't seem to find one in the manual?

Thanks.

I have the same problem. I created genemap file using kb ref. A help would be appreciated.

bound-to-love commented 2 months ago

Hi! The error may be twofold. First, note according to the bustools manual -g, --genemap File for mapping transcripts to genes whereas a gtf was supplied. kb ref can be used to generate the appropriate transcript to gene mapping. Second, bustools sort is being supplied a directory as the output file; I would suggest supplying outdir/sorted.bus and then bustools count should be given outdir/sorted.bus (the sorted bus file instead of the direct output of kallisto bus). Apologies for my delay in responding, but fixing these two errors should fix the issue.

Finally, if you want to have transcript quantifications you can use

${path_to_lr_kallisto} quant-tcc -t 32 \ ${output}/count.mtx \ -i ${ref}_k-63.idx \ -f ${output}/flens.txt \ -e ${output}/count.ec.txt \ -o ${output};

bound-to-love commented 2 months ago

Hi, I just wanted to give a full example to, hopefully, clear up any remaining confusion. Note that the flags --cm and -m are used with bustools count to count multiplicity (since it is bulk) and include multimapping:

${path_to_lr_kallisto} bus -t 32 --long --threshold 0.8 -x bulk -i ${ref}_k-63.idx -o ${output} ${reads}

${path_to_bustools} sort -t 32 ${output}/output.bus \
 -o ${output}/sorted.bus; \
 ${path_to_bustools} count ${output}/sorted.bus \
 -t ${output}/transcripts.txt \
 -e ${output}/matrix.ec \
 -o ${output}/count --cm -m \
 -g ${ref}.t2g; \

${path_to_lr_kallisto} quant-tcc -t 32 \
    --long -P ONT -f ${output}/flens.txt \
    ${output}/count.mtx -i ${ref}_k-63.idx \
    -e ${output}/count.ec.txt \
    -o ${output}; 

Please let me know if you run into any more difficulties!

icemduru commented 2 months ago

Hi, I just wanted to give a full example to, hopefully, clear up any remaining confusion. Note that the flags --cm and -m are used with bustools count to count multiplicity (since it is bulk) and include multimapping:

${path_to_lr_kallisto} bus -t 32 --long --threshold 0.8 -x bulk -i ${ref}_k-63.idx -o ${output} ${reads}

${path_to_bustools} sort -t 32 ${output}/output.bus \
 -o ${output}/sorted.bus; \
 ${path_to_bustools} count ${output}/sorted.bus \
 -t ${output}/transcripts.txt \
 -e ${output}/matrix.ec \
 -o ${output}/count --cm -m \
 -g ${ref}.t2g; \

${path_to_lr_kallisto} quant-tcc -t 32 \
  --long -P ONT -f ${output}/flens.txt \
  ${output}/count.mtx -i ${ref}_k-63.idx \
  -e ${output}/count.ec.txt \
  -o ${output}; 

Please let me know if you run into any more difficulties!

Hi Rebekah,

For me it helped to include --cm and -m options. Thanks a lot for the help. I think it is also useful to have --matrix-to-files option in quant-tcc command.

Regards

bound-to-love commented 2 months ago

You're welcome! Thanks for the feedback!

sbresnahan commented 1 month ago

Hi @bound-to-love - this issue thread was very helpful for learning how to use lr-kallisto for bulk lrRNAseq! I had to find this in the issues rather than the manual page though to figure it out. It would be great if the manual could be updated to more clearly demonstrate this use case as you have done here.