pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

Issue trying to use batch file for fastq pairs that are already demultiplexed #203

Closed vweigman closed 1 year ago

vweigman commented 1 year ago

Describe the issue running kb count errors when providing fastq files and documented '--batch' flag to provide batch fastq file is not recognized. scRNA-Seq technology used does not have UMIs and already produces fastq pairs (BOTH reads contain RNA insert sequence), so just need to perform counting on either of the fastq reads.

What is the exact command that was run?

kb count --overwrite --verbose -t 4 -g /home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.t2g -i /home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.idx -o AACGTTWM5-TME20-026-F03-DNA-SC01_S137 -x BULK --parity paired --strand forward --workflow kite AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R1_001.fastq.gz AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R2_001.fastq.gz

Command output (with --verbose flag)

[2023-04-24 13:57:57,647]   DEBUG [main] Printing verbose output
[2023-04-24 13:57:59,861]   DEBUG [main] kallisto binary located at /home/ubuntu/.local/lib/python3.6/site-packages/kb_python/bins/compiled/kallisto/kallisto
[2023-04-24 13:57:59,862]   DEBUG [main] bustools binary located at /home/ubuntu/.local/lib/python3.6/site-packages/kb_python/bins/compiled/bustools/bustools
[2023-04-24 13:57:59,862]   DEBUG [main] Creating `AACGTTWM5-TME20-026-F03-DNA-SC01_S137/tmp` directory
[2023-04-24 13:57:59,862]   DEBUG [main] Namespace(bustools='/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/bins/compiled/bustools/bustools', c1=None, c2=None, cellranger=False, command='count', dry_run=False, em=False, fastqs=['AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R1_001.fastq.gz', 'AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R2_001.fastq.gz'], filter=None, filter_threshold=None, fragment_l=None, fragment_s=None, g='/home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.t2g', gene_names=False, h5ad=False, i='/home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.idx', kallisto='/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/bins/compiled/kallisto/kallisto', keep_tmp=False, list=False, loom=False, m='4G', mm=False, no_inspect=False, no_validate=False, o='AACGTTWM5-TME20-026-F03-DNA-SC01_S137', overwrite=True, parity='paired', report=False, strand='forward', t=4, tcc=False, tmp=None, umi_gene=False, verbose=True, w=None, workflow='kite', x='BULK')
[2023-04-24 13:57:59,862] WARNING [main] FASTQs were provided for technology `BULK`. Assuming multiplexed samples. For demultiplexed samples, provide a batch textfile.
[2023-04-24 13:58:02,970]    INFO [count] Using index /home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.idx to generate BUS file to AACGTTWM5-TME20-026-F03-DNA-SC01_S137 from
[2023-04-24 13:58:02,970]    INFO [count]         AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R1_001.fastq.gz
[2023-04-24 13:58:02,970]    INFO [count]         AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R2_001.fastq.gz
[2023-04-24 13:58:02,970]   DEBUG [count] kallisto bus -i /home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.idx -o AACGTTWM5-TME20-026-F03-DNA-SC01_S137 -x BULK -t 4 --paired --fr-stranded AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R1_001.fastq.gz AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R2_001.fastq.gz
[2023-04-24 13:58:04,079]   DEBUG [count] 
[2023-04-24 13:58:04,079]   DEBUG [count] Error: Number of files (2) does not match number of input files required by technology BULK (4)
[2023-04-24 13:58:04,079]   DEBUG [count] kallisto 0.48.0
[2023-04-24 13:58:04,079]   DEBUG [count] Generates BUS files for single-cell sequencing
[2023-04-24 13:58:04,079]   DEBUG [count] 
[2023-04-24 13:58:04,079]   DEBUG [count] Usage: kallisto bus [arguments] FASTQ-files
[2023-04-24 13:58:04,079]   DEBUG [count] 
[2023-04-24 13:58:04,079]   DEBUG [count] Required arguments:
[2023-04-24 13:58:04,079]   DEBUG [count] -i, --index=STRING            Filename for the kallisto index to be used for
[2023-04-24 13:58:04,079]   DEBUG [count] pseudoalignment
[2023-04-24 13:58:04,080]   DEBUG [count] -o, --output-dir=STRING       Directory to write output to
[2023-04-24 13:58:04,080]   DEBUG [count] 
[2023-04-24 13:58:04,080]   DEBUG [count] Optional arguments:
[2023-04-24 13:58:04,080]   DEBUG [count] -x, --technology=STRING       Single-cell technology used
[2023-04-24 13:58:04,080]   DEBUG [count] -l, --list                    List all single-cell technologies supported
[2023-04-24 13:58:04,080]   DEBUG [count] -B, --batch=FILE              Process files listed in FILE
[2023-04-24 13:58:04,080]   DEBUG [count] -t, --threads=INT             Number of threads to use (default: 1)
[2023-04-24 13:58:04,080]   DEBUG [count] -b, --bam                     Input file is a BAM file
[2023-04-24 13:58:04,080]   DEBUG [count] -n, --num                     Output number of read in flag column (incompatible with --bam)
[2023-04-24 13:58:04,080]   DEBUG [count] -T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
[2023-04-24 13:58:04,080]   DEBUG [count] --fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
[2023-04-24 13:58:04,080]   DEBUG [count] --rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
[2023-04-24 13:58:04,080]   DEBUG [count] --unstranded              Treat all read as non-strand-specific
[2023-04-24 13:58:04,080]   DEBUG [count] --paired                  Treat reads as paired
[2023-04-24 13:58:04,080]   DEBUG [count] --genomebam               Project pseudoalignments to genome sorted BAM file
[2023-04-24 13:58:04,080]   DEBUG [count] -g, --gtf                     GTF file for transcriptome information
[2023-04-24 13:58:04,080]   DEBUG [count] (required for --genomebam)
[2023-04-24 13:58:04,080]   DEBUG [count] -c, --chromosomes             Tab separated file with chromosome names and lengths
[2023-04-24 13:58:04,080]   DEBUG [count] (optional for --genomebam, but recommended)
[2023-04-24 13:58:04,081]   DEBUG [count] --verbose                 Print out progress information every 1M proccessed reads
[2023-04-24 13:58:04,081]   ERROR [count] 
Error: Number of files (2) does not match number of input files required by technology BULK (4)
kallisto 0.48.0
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number of threads to use (default: 1)
-b, --bam                     Input file is a BAM file
-n, --num                     Output number of read in flag column (incompatible with --bam)
-T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
--unstranded              Treat all read as non-strand-specific
--paired                  Treat reads as paired
--genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
--verbose                 Print out progress information every 1M proccessed reads
[2023-04-24 13:58:04,081]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/main.py", line 1305, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/main.py", line 578, in parse_count
    by_name=args.gene_names
  File "/home/ubuntu/.local/lib/python3.6/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/count.py", line 1045, in count
    strand=strand,
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/validate.py", line 116, in inner
    results = func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/count.py", line 150, in kallisto_bus
    run_executable(command)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/home/ubuntu/.local/lib/python3.6/site-packages/kb_python/bins/compiled/kallisto/kallisto bus -i /home/ubuntu/environment/RnD/DCIS-manuscript/TotalSeq/TotalSeqA_CocktailBarcodes.idx -o AACGTTWM5-TME20-026-F03-DNA-SC01_S137 -x BULK -t 4 --paired --fr-stranded AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R1_001.fastq.gz AACGTTWM5-TME20-026-F03-DNA-SC01_S137_L001_R2_001.fastq.gz' returned non-zero exit status 1.
[2023-04-24 13:58:04,082]   DEBUG [main] Removing `AACGTTWM5-TME20-026-F03-DNA-SC01_S137/tmp` directory
Yenaled commented 1 year ago

Try putting the files in a batch.txt file as follows:

batch1 R1.fastq.gz R2.fastq.gz

And then rerunning the command by supplying batch.txt in lieu of those files on the command line.

vweigman commented 1 year ago

Yep, that worked. And I got an output in AACGTTWM5-TME20-026-F03-DNA-SC01_S137/counts_unfiltered/cells_x_features.mtx that looks like this:

%%MatrixMarket matrix coordinate real general % %
1 174 140 1 1 6 1 3 5 1 4 5 1 5 3 1 6 13 1 7 2

I assume i can ignore the first 4 rows and then the columns are: sampleNumber(specified in the batch file TagID (specified in my barcode file (and in the cells_x_features.genes.txt) Counts (of that barcode in those fastqs.

Is that the right way to interpret those counts. And is that counting done on both read pairs? Appreciate the help and super fast response!

Yenaled commented 1 year ago

Yep, that is all correct!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days