pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

Error in generating pseudobam in new dev version #196

Closed jkniehaus closed 1 year ago

jkniehaus commented 1 year ago

Hello,

Great tool! I was really excited to see that the latest dev version included added options to generate a BAM file. However, I'm running into an error during alignment with SMARTSEQ2 data. It looks as though the kallisto bus command doesn't register the technology.

Any guidance would be a big help.

Thanks! Jesse

Command:

kb count --verbose  \
-i kb_genome/transcriptome_GTE8.idx \
-g kb_genome/t2g_GTE8.txt \
-x SMARTSEQ2 \
--parity paired \
-o kallisto_out_dnaBAM030323/ \
-t 60 -m 60G --genomebam \
--gtf kb_genome/mm39GTE8.gtf \
--chromosomes kb_genome/mm39sizes.genome \
--tcc --loom --overwrite seq_batch.tsv

output:

[2023-03-03 19:45:19,847]   DEBUG [main] Printing verbose output
[2023-03-03 19:45:21,968]   DEBUG [main] kallisto binary located at /nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/bins/linux/kallisto/kallisto
[2023-03-03 19:45:21,968]   DEBUG [main] bustools binary located at /nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/bins/linux/bustools/bustools
[2023-03-03 19:45:21,970]   DEBUG [main] Creating `kallisto_out_dnaBAM030323/tmp` directory
[2023-03-03 19:45:21,973]   DEBUG [main] Namespace(bustools='/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/bins/linux/bustools/bustools', c1=None, c2=None, cellranger=False, chromosomes='kb_genome/mm39sizes.genome', command='count', dry_run=False, em=False, fastqs=['seq_batch.tsv'], filter=None, filter_threshold=None, fragment_l=None, fragment_s=None, g='kb_genome/t2g_GTE8.txt', gene_names=False, genomebam=True, gtf='kb_genome/mm39GTE8.gtf', h5ad=False, i='kb_genome/transcriptome_GTE8.idx', kallisto='/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/bins/linux/kallisto/kallisto', keep_tmp=False, list=False, loom=True, m='60G', mm=False, no_inspect=False, no_validate=False, o='kallisto_out_dnaBAM030323/', overwrite=True, parity='paired', report=False, strand=None, t=60, tcc=True, tmp=None, umi_gene=False, verbose=True, w=None, workflow='standard', x='SMARTSEQ2')
[2023-03-03 19:45:24,953]    INFO [count] Using index kb_genome/transcriptome_GTE8.idx to generate BUS file to kallisto_out_dnaBAM030323/ from
[2023-03-03 19:45:24,953]    INFO [count]         /work/users/k/y/kylius0/ssv4/ACA/fq/kallisto_out_dnaBAM030323/tmp/tmpbxggm71v
[2023-03-03 19:45:24,953]   DEBUG [count] kallisto bus -i kb_genome/transcriptome_GTE8.idx -o kallisto_out_dnaBAM030323/ -t 60 --paired --genomebam -g kb_genome/mm39GTE8.gtf -c kb_genome/mm39sizes.genome --batch /work/users/k/y/kylius0/ssv4/ACA/fq/kallisto_out_dnaBAM030323/tmp/tmpbxggm71v
[2023-03-03 19:45:25,058]   DEBUG [count] 
[2023-03-03 19:45:25,059]   DEBUG [count] [bus] no technology specified; will try running read files supplied in batch file
[2023-03-03 19:45:25,059]   DEBUG [count] [bus] --paired ignored; single/paired-end is inferred from number of files supplied
[2023-03-03 19:45:33,874]   DEBUG [count] Error: Pseudobam not supported yet in this mode
[2023-03-03 19:45:33,875]   DEBUG [count] kallisto 0.48.0
[2023-03-03 19:45:33,875]   DEBUG [count] Generates BUS files for single-cell sequencing
[2023-03-03 19:45:33,875]   DEBUG [count] 
[2023-03-03 19:45:33,875]   DEBUG [count] Usage: kallisto bus [arguments] FASTQ-files
[2023-03-03 19:45:33,875]   DEBUG [count] 
[2023-03-03 19:45:33,875]   DEBUG [count] Required arguments:
[2023-03-03 19:45:33,875]   DEBUG [count] -i, --index=STRING            Filename for the kallisto index to be used for
[2023-03-03 19:45:33,875]   DEBUG [count] pseudoalignment
[2023-03-03 19:45:33,875]   DEBUG [count] -o, --output-dir=STRING       Directory to write output to
[2023-03-03 19:45:33,875]   DEBUG [count] 
[2023-03-03 19:45:33,875]   DEBUG [count] Optional arguments:
[2023-03-03 19:45:33,875]   DEBUG [count] -x, --technology=STRING       Single-cell technology used
[2023-03-03 19:45:33,875]   DEBUG [count] -l, --list                    List all single-cell technologies supported
[2023-03-03 19:45:33,875]   DEBUG [count] -B, --batch=FILE              Process files listed in FILE
[2023-03-03 19:45:33,875]   DEBUG [count] -t, --threads=INT             Number of threads to use (default: 1)
[2023-03-03 19:45:33,875]   DEBUG [count] -b, --bam                     Input file is a BAM file
[2023-03-03 19:45:33,875]   DEBUG [count] -n, --num                     Output number of read in flag column (incompatible with --bam)
[2023-03-03 19:45:33,875]   DEBUG [count] -T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
[2023-03-03 19:45:33,875]   DEBUG [count] --fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
[2023-03-03 19:45:33,875]   DEBUG [count] --rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
[2023-03-03 19:45:33,875]   DEBUG [count] --unstranded              Treat all read as non-strand-specific
[2023-03-03 19:45:33,875]   DEBUG [count] --paired                  Treat reads as paired
[2023-03-03 19:45:33,875]   DEBUG [count] --genomebam               Project pseudoalignments to genome sorted BAM file
[2023-03-03 19:45:33,875]   DEBUG [count] -g, --gtf                     GTF file for transcriptome information
[2023-03-03 19:45:33,875]   DEBUG [count] (required for --genomebam)
[2023-03-03 19:45:33,875]   DEBUG [count] -c, --chromosomes             Tab separated file with chromosome names and lengths
[2023-03-03 19:45:33,875]   DEBUG [count] (optional for --genomebam, but recommended)
[2023-03-03 19:45:33,875]   DEBUG [count] --verbose                 Print out progress information every 1M proccessed reads
[2023-03-03 19:45:33,876]   ERROR [count] 
[bus] no technology specified; will try running read files supplied in batch file [bus] --paired ignored; single/paired-end is inferred from number of files supplied
Error: Pseudobam not supported yet in this mode kallisto 0.48.0 Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number of threads to use (default: 1)
-b, --bam                     Input file is a BAM file
-n, --num                     Output number of read in flag column (incompatible with --bam)
-T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
--unstranded              Treat all read as non-strand-specific
--paired                  Treat reads as paired
--genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
--verbose                 Print out progress information every 1M proccessed reads
[2023-03-03 19:45:33,876]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/main.py", line 1347, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/main.py", line 566, in parse_count
    count(
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/count.py", line 1068, in count
    bus_result = kallisto_bus(
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/validate.py", line 116, in inner
    results = func(*args, **kwargs)
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/count.py", line 171, in kallisto_bus
    run_executable(command)
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/nas/longleaf/home/kylius0/.local/lib/python3.8/site-packages/kb_python-0.27.3-py3.8.egg/kb_python/bins/linux/kallisto/kallisto bus -i kb_genome/transcriptome_GTE8.idx -o kallisto_out_dnaBAM030323/ -t 60 --paired --genomebam -g kb_genome/mm39GTE8.gtf -c kb_genome/mm39sizes.genome --batch /work/users/k/y/kylius0/ssv4/ACA/fq/kallisto_out_dnaBAM030323/tmp/tmpbxggm71v' returned non-zero exit status 1.
[2023-03-03 19:45:33,884]   DEBUG [main] Removing `kallisto_out_dnaBAM030323/tmp` directory
Yenaled commented 1 year ago

The technology works, just pseudobam for smart-seq2 isn't supported yet in version 0.48.0 of kallisto. I'm currently trying to re-engineer the technology w/ pseudobam to work in the next version of kallisto (I'm planning to release the next version of kallisto, an immense update, by June so hopefully it'll make it onto that release).

The only option currently is to reformat your FASTQ files to have barcodes and UMIs so that pseudobam works. (Or to just run each pair of FASTQ reads individually through kallisto quant). It will take a bit of work and extra time to do those things, which is why I'm working on getting it into the next kallisto release.

jkniehaus commented 1 year ago

Gotcha. I will try the ladder approach for simplicity (either run fqs individually, or subset fqs by cluster and combine). Looking forward to the release.