pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

Running kb_python for Smart-seq3 data where index files have not been provided. #257

Closed axelalmet closed 3 weeks ago

axelalmet commented 3 weeks ago

Hi kb-python team,

I am interested in processing some published Smart-seq3 data to get unspliced and spliced counts (mainly out of curiosity).

This was the command I tried to run with kb count for a single sample:

kb count --h5ad -i ../../kb_indices/mm10/index.idx -g ../../kb_indices/mm10/t2g.txt -x SmartSeq3 -o SRR22570661 \
            -c1 ../../kb_indices/mm10/cdna.txt -c2 ../../kb_indices/mm10/nascent.txt --workflow nac -t 8 \
            SRR22570661_1.fastq.gz SRR22570661_2.fastq.gz

This is the resulting error I get:

[2024-06-11 21:25:45,119]    INFO [count_nac] Using index ../../kb_indices/mm10/index.idx to generate BUS file to SRR22570661 from
[2024-06-11 21:25:45,119]    INFO [count_nac]         SRR22570661_1.fastq.gz
[2024-06-11 21:25:45,119]    INFO [count_nac]         SRR22570661_2.fastq.gz
[2024-06-11 21:25:46,220]   ERROR [count_nac] 
kallisto 0.50.1
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number of threads to use (default: 1)
-b, --bam                     Input file is a BAM file
-n, --num                     Output number of read in flag column (incompatible with --bam)
-N, --numReads                Maximum number of reads to process from supplied input
-T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
--unstranded              Treat all read as non-strand-specific
--paired                  Treat reads as paired
--aa                      Align to index generated from a FASTA-file containing amino acid sequences
--inleaved                Specifies that input is an interleaved FASTQ file
--batch-barcodes          Records both batch and extracted barcode in BUS file
--verbose                 Print out progress information every 1M proccessed reads
[bus] Using ATTGCGCAATG as UMI tag sequence
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
Error: Number of files (2) does not match number of input files required by technology SMARTSEQ3 (4)
[2024-06-11 21:25:46,220]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/main.py", line 1618, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/main.py", line 592, in parse_count
    count_nac(
  File "/home/axel/miniconda3/lib/python3.12/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/count.py", line 1791, in count_nac
    bus_result = kallisto_bus(
                 ^^^^^^^^^^^^^
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/count.py", line 203, in kallisto_bus
    run_executable(command)
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/home/axel/miniconda3/lib/python3.12/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i ../../kb_indices/mm10/index.idx -o SRR22570661 -x SmartSeq3 -t 8 --paired SRR22570661_1.fastq.gz SRR22570661_2.fastq.gz' returned non-zero exit status 1.

which all makes sense, as I understand that one has to provide the index I1 and I2 files for SmartSeq3 analysis.

However, when I reexamined uploaded data at SRA run browser here, I see that the authors didn't upload the index files.

Therefore, is it possible to still realign the data without these index files?

Thank you for your time!

Best wishes, Axel.

Yenaled commented 3 weeks ago

I haven’t looked at that dataset, but without the index files, how do you correspond reads to cells? I.e. how do you know which read belongs to which cell?

axelalmet commented 3 weeks ago

Hi @Yenaled,

I'm definitely not an expert on Smart-seq3 data by any means, but from what I can tell from the GEO accession, there's an associated metadata file that tells you which cell corresponds to which run, which matches the processed barcodes provided by the authors.

I'm not sure if that fully answers your question, but that's the most I know right now.

Best wishes, Axel.

Yenaled commented 3 weeks ago

In that case, your data is "demultiplexed" (i.e. each FASTQ file represents an individual cell). This is identical to the issue here: https://github.com/pachterlab/kb_python/issues/235 where a solution is presented.

axelalmet commented 3 weeks ago

Solution #235 worked a treat, thank you so much!