Closed alanlamsiu closed 2 months ago
See the —long options in the manual https://pachterlab.github.io/kallisto/manual . To run lr-kallisto on bulk reads use -x bulk —long. Note there is also a —long option for quant-tcc for quantifying at the transcript level. Also, note that you will need to build kallisto from source in a new directory with cmake -DMAX_KMER_SIZE=64 ..
Hi, just wanted to add the note that kallisto index will need run with -k 63. Let me know if you run into any issues!
Hi @bound-to-love, thanks for the advice. I built kallisto 0.51.0 from source by setting DMAX_KMER_SIZE=64. When I checked the help manual of quant-tcc using command line, I got the following,
Quantifies abundance from pre-computed transcript-compatibility counts
Usage: kallisto quant-tcc [arguments] transcript-compatibility-counts-file
Required arguments:
-o, --output-dir=STRING Directory to write output to
Optional arguments:
-i, --index=STRING Filename for the kallisto index to be used
(required if file with names of transcripts not supplied)
-T, --txnames=STRING File with names of transcripts
(required if index file not supplied)
-e, --ec-file=FILE File containing equivalence classes
(default: equivalence classes are taken from the index)
-f, --fragment-file=FILE File containing fragment length distribution
(default: effective length normalization is not performed)
--long Use version of EM for long reads
-P, --platform. [PacBio or ONT] used for sequencing
-l, --fragment-length=DOUBLE Estimated average fragment length
-s, --sd=DOUBLE Estimated standard deviation of fragment length
(note: -l, -s values only should be supplied when
effective length normalization needs to be performed
but --fragment-file is not specified)
-p, --priors Priors for the EM algorithm, either as raw counts or as
probabilities. Pseudocounts are added to raw reads to
prevent zero valued priors. Supplied in the same order
as the transcripts in the transcriptome
-t, --threads=INT Number of threads to use (default: 1)
-g, --genemap File for mapping transcripts to genes
(required for obtaining gene-level abundances)
-G, --gtf=FILE GTF file for transcriptome information
(can be used instead of --genemap for obtaining gene-level abundances)
-b, --bootstrap-samples=INT Number of bootstrap samples (default: 0)
--matrix-to-files Reorganize matrix output into abundance tsv files
--matrix-to-directories Reorganize matrix output into abundance tsv files across multiple directories
--seed=INT Seed for the bootstrap sampling (default: 42)
--plaintext Output plaintext only, not HDF5
which looks different than the one in https://pachterlab.github.io/kallisto/manual. I have both the index and fastq files. What would be the steps in order to generate the transcript-compatibility-counts-file
required by quant-tcc?
Hello, as described in the manual "The necessary files can all be generated from bustools count in bustools." So the commands would be kallisto bus, bustools sort, bustools correct (optional), bustools count, kallisto quant-tcc with all the necessary inputs/outputs.
Hi @bound-to-love, I learned the steps from a kallisto issue posted by another user and here are the steps.
kallisto bus -i index_file -o outdir --long -x bulk sample.fastq.gz
bustools sort -o outdir outdir/output.bus
bustools count -o outdir/out -g sample.gtf -e outdir/matrix.ec -t outdir/transcripts.txt outdir/output.bus
I notice that the outdir/output.bus
file is not empty. The beginning few lines of outdir/matrix.ec
shows below. And the outdir/transcripts.txt
file shows a list of transcript IDs as in the transcriptome fasta file used for making the index.
0 188087
1 128482,133273
2 129500
3 128482
4 128482,129500,130163,133273
However, directory outdir/out
is empty. Could you advice on this issue? And could you give an example on how to run the quant-tcc command as I don't seem to find one in the manual?
Thanks.
Hi @bound-to-love, I learned the steps from a kallisto issue posted by another user and here are the steps.
kallisto bus -i index_file -o outdir --long -x bulk sample.fastq.gz bustools sort -o outdir outdir/output.bus bustools count -o outdir/out -g sample.gtf -e outdir/matrix.ec -t outdir/transcripts.txt outdir/output.bus
I notice that the
outdir/output.bus
file is not empty. The beginning few lines ofoutdir/matrix.ec
shows below. And theoutdir/transcripts.txt
file shows a list of transcript IDs as in the transcriptome fasta file used for making the index.0 188087 1 128482,133273 2 129500 3 128482 4 128482,129500,130163,133273
However, directory
outdir/out
is empty. Could you advice on this issue? And could you give an example on how to run the quant-tcc command as I don't seem to find one in the manual?Thanks.
I have the same problem. I created genemap file using kb ref. A help would be appreciated.
Hi! The error may be twofold. First, note according to the bustools manual -g, --genemap File for mapping transcripts to genes
whereas a gtf was supplied. kb ref can be used to generate the appropriate transcript to gene mapping. Second, bustools sort is being supplied a directory as the output file; I would suggest supplying outdir/sorted.bus and then bustools count should be given outdir/sorted.bus (the sorted bus file instead of the direct output of kallisto bus). Apologies for my delay in responding, but fixing these two errors should fix the issue.
Finally, if you want to have transcript quantifications you can use
${path_to_lr_kallisto} quant-tcc -t 32 \ ${output}/count.mtx \ -i ${ref}_k-63.idx \ -f ${output}/flens.txt \ -e ${output}/count.ec.txt \ -o ${output};
Hi, I just wanted to give a full example to, hopefully, clear up any remaining confusion. Note that the flags --cm and -m are used with bustools count to count multiplicity (since it is bulk) and include multimapping:
${path_to_lr_kallisto} bus -t 32 --long --threshold 0.8 -x bulk -i ${ref}_k-63.idx -o ${output} ${reads}
${path_to_bustools} sort -t 32 ${output}/output.bus \
-o ${output}/sorted.bus; \
${path_to_bustools} count ${output}/sorted.bus \
-t ${output}/transcripts.txt \
-e ${output}/matrix.ec \
-o ${output}/count --cm -m \
-g ${ref}.t2g; \
${path_to_lr_kallisto} quant-tcc -t 32 \
--long -P ONT -f ${output}/flens.txt \
${output}/count.mtx -i ${ref}_k-63.idx \
-e ${output}/count.ec.txt \
-o ${output};
Please let me know if you run into any more difficulties!
Hi, I just wanted to give a full example to, hopefully, clear up any remaining confusion. Note that the flags --cm and -m are used with bustools count to count multiplicity (since it is bulk) and include multimapping:
${path_to_lr_kallisto} bus -t 32 --long --threshold 0.8 -x bulk -i ${ref}_k-63.idx -o ${output} ${reads} ${path_to_bustools} sort -t 32 ${output}/output.bus \ -o ${output}/sorted.bus; \ ${path_to_bustools} count ${output}/sorted.bus \ -t ${output}/transcripts.txt \ -e ${output}/matrix.ec \ -o ${output}/count --cm -m \ -g ${ref}.t2g; \ ${path_to_lr_kallisto} quant-tcc -t 32 \ --long -P ONT -f ${output}/flens.txt \ ${output}/count.mtx -i ${ref}_k-63.idx \ -e ${output}/count.ec.txt \ -o ${output};
Please let me know if you run into any more difficulties!
Hi Rebekah,
For me it helped to include --cm and -m options. Thanks a lot for the help. I think it is also useful to have --matrix-to-files option in quant-tcc command.
Regards
You're welcome! Thanks for the feedback!
Hi @bound-to-love - this issue thread was very helpful for learning how to use lr-kallisto for bulk lrRNAseq! I had to find this in the issues rather than the manual page though to figure it out. It would be great if the manual could be updated to more clearly demonstrate this use case as you have done here.
Hi kallisto team,
Could you point me to the documentation showing steps for running lr-kallisto for long-read bulk RNA-seq?