pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

Error: Number of files (9) does not match number of input files required by technology 10XV3 (2) #174

Closed monoplasty closed 1 year ago

monoplasty commented 1 year ago

Describe the issue kb_python 0.27.3

I get this error when trying to process the same batch of data, I don't know why this is happening? Hope to give some advice, thank you!

What is the exact command that was run?

kb count -i /data/kallisto/refdata/human/transcriptome.idx -g /data/kallisto/refdata/human/transcripts_to_genes.txt -t 16 -m 8G --h5ad --verbose  --overwrite  -x 10XV3 -o /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/ \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R2.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_I1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_I1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R2.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_I1.fastq.gz \
/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R2.fastq.gz

Command output (with --verbose flag)

[2022-09-21 16:49:23,489]   DEBUG [main] Printing verbose output
[2022-09-21 16:49:25,608]   DEBUG [main] kallisto binary located at /usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto
[2022-09-21 16:49:25,608]   DEBUG [main] bustools binary located at /usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/bins/linux/bustools/bustools
[2022-09-21 16:49:25,608]   DEBUG [main] Creating `/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/tmp` directory
[2022-09-21 16:49:25,608]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='/data/kallisto/refdata/human/transcriptome.idx', g='/data/kallisto/refdata/human/transcripts_to_genes.txt', x='10XV3', o='/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/', w='None', t=8, m='8G', strand=None, workflow='standard', em=False, umi_gene=False, mm=False, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=True, dry_run=False, loom=False, h5ad=True, cellranger=False, gene_names=False, report=False, no_inspect=False, kallisto='/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto', bustools='/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/bins/linux/bustools/bustools', no_validate=False, parity=None, fragment_l=None, fragment_s=None, fastqs=['/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R2.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_I1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_I1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R2.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_I1.fastq.gz', '/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R2.fastq.gz'])
[2022-09-21 16:49:28,004]    INFO [count] Using index /data/kallisto/refdata/human/transcriptome.idx to generate BUS file to /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/ from
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R2.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_I1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_I1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R2.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_I1.fastq.gz
[2022-09-21 16:49:28,005]    INFO [count]         /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R2.fastq.gz
[2022-09-21 16:49:28,005]   DEBUG [count] kallisto bus -i /data/kallisto/refdata/human/transcriptome.idx -o /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/ -x 10XV3 -t 8 /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R2.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R2.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R2.fastq.gz
[2022-09-21 16:49:29,113]   DEBUG [count] 
[2022-09-21 16:49:29,113]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2022-09-21 16:49:29,113]   DEBUG [count] Error: Number of files (9) does not match number of input files required by technology 10XV3 (2)
[2022-09-21 16:49:29,113]   DEBUG [count] kallisto 0.48.0
[2022-09-21 16:49:29,113]   DEBUG [count] Generates BUS files for single-cell sequencing
[2022-09-21 16:49:29,113]   DEBUG [count] 
[2022-09-21 16:49:29,113]   DEBUG [count] Usage: kallisto bus [arguments] FASTQ-files
[2022-09-21 16:49:29,113]   DEBUG [count] 
[2022-09-21 16:49:29,113]   DEBUG [count] Required arguments:
[2022-09-21 16:49:29,113]   DEBUG [count] -i, --index=STRING            Filename for the kallisto index to be used for
[2022-09-21 16:49:29,114]   DEBUG [count] pseudoalignment
[2022-09-21 16:49:29,114]   DEBUG [count] -o, --output-dir=STRING       Directory to write output to
[2022-09-21 16:49:29,114]   DEBUG [count] 
[2022-09-21 16:49:29,114]   DEBUG [count] Optional arguments:
[2022-09-21 16:49:29,114]   DEBUG [count] -x, --technology=STRING       Single-cell technology used
[2022-09-21 16:49:29,114]   DEBUG [count] -l, --list                    List all single-cell technologies supported
[2022-09-21 16:49:29,114]   DEBUG [count] -B, --batch=FILE              Process files listed in FILE
[2022-09-21 16:49:29,114]   DEBUG [count] -t, --threads=INT             Number of threads to use (default: 1)
[2022-09-21 16:49:29,114]   DEBUG [count] -b, --bam                     Input file is a BAM file
[2022-09-21 16:49:29,114]   DEBUG [count] -n, --num                     Output number of read in flag column (incompatible with --bam)
[2022-09-21 16:49:29,114]   DEBUG [count] -T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
[2022-09-21 16:49:29,114]   DEBUG [count] --fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
[2022-09-21 16:49:29,114]   DEBUG [count] --rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
[2022-09-21 16:49:29,114]   DEBUG [count] --unstranded              Treat all read as non-strand-specific
[2022-09-21 16:49:29,114]   DEBUG [count] --paired                  Treat reads as paired
[2022-09-21 16:49:29,114]   DEBUG [count] --genomebam               Project pseudoalignments to genome sorted BAM file
[2022-09-21 16:49:29,114]   DEBUG [count] -g, --gtf                     GTF file for transcriptome information
[2022-09-21 16:49:29,114]   DEBUG [count] (required for --genomebam)
[2022-09-21 16:49:29,114]   DEBUG [count] -c, --chromosomes             Tab separated file with chromosome names and lengths
[2022-09-21 16:49:29,114]   DEBUG [count] (optional for --genomebam, but recommended)
[2022-09-21 16:49:29,114]   DEBUG [count] --verbose                 Print out progress information every 1M proccessed reads
[2022-09-21 16:49:29,114]   ERROR [count] 
[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
Error: Number of files (9) does not match number of input files required by technology 10XV3 (2)
kallisto 0.48.0
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number of threads to use (default: 1)
-b, --bam                     Input file is a BAM file
-n, --num                     Output number of read in flag column (incompatible with --bam)
-T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
--unstranded              Treat all read as non-strand-specific
--paired                  Treat reads as paired
--genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
--verbose                 Print out progress information every 1M proccessed reads
[2022-09-21 16:49:29,114]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/main.py", line 1305, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/main.py", line 550, in parse_count
    count(
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/count.py", line 1038, in count
    bus_result = kallisto_bus(
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/validate.py", line 116, in inner
    results = func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/count.py", line 150, in kallisto_bus
    run_executable(command)
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/usr/local/miniconda3/envs/jupyter/lib/python3.9/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i /data/kallisto/refdata/human/transcriptome.idx -o /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/ -x 10XV3 -t 8 /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R2.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/50b441e7-4566-4fd8-a0c0-0c6ed64aa487/SRR12506861_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R2.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/a0ccea38-0801-4df9-a7de-608030b758ed/SRR12506862_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_I1.fastq.gz /data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/96f6bdc3-38dc-470a-8f44-ea6462c01764/SRR12506863_R2.fastq.gz' returned non-zero exit status 1.
[2022-09-21 16:49:29,116]   DEBUG [main] Removing `/data/kallisto/fastqs/AdaptiveNKCellsInMultipleMyeloma/h5adoutput/tmp` directory
Yenaled commented 1 year ago

10XV3 requires 2 files (the first file contains the barcode+UMI while the second file contains the actual biological read).

You should figure out what exactly is in your FASTQ files before blindly throwing into the kb count command (i.e. I don't know why you have 9 files and why some are labeled R1 and others are labeled I1).

monoplasty commented 1 year ago

Very appreciate for your prompt reply! I added a parameter (--strand unstranded) and it worked, although I don't know why.

monoplasty commented 1 year ago

10XV3 requires 2 files (the first file contains the barcode+UMI while the second file contains the actual biological read).

You should figure out what exactly is in your FASTQ files before blindly throwing into the kb count command (i.e. I don't know why you have 9 files and why some are labeled R1 and others are labeled I1).

I am using the data file from this link。https://data.humancellatlas.org/explore/projects/2eb4f5f8-42a5-4368-aa2d-337bacb96197 . So I don't know why there is a file labeled I1. Can you help me analyze it? thank you very much!

nrclaudio commented 1 year ago

If I were you, I'd do: zcat SRRID | head -n 10 to figure out what each file in your folder is.

For instance, index reads usually contain 8 bp, read 1 around 28 bp and read 2 91bp. For Kallisto|Bustools you only need Read 1 and Read 2, if you include the index (I1) it will throw this error.

monoplasty commented 1 year ago

If I were you, I'd do: zcat SRRID | head -n 10 to figure out what each file in your folder is.

For instance, index reads usually contain 8 bp, read 1 around 28 bp and read 2 91bp. For Kallisto|Bustools you only need Read 1 and Read 2, if you include the index (I1) it will throw this error.

Thanks for your advice.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days