pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

kb-python processing significantly less reads #241

Closed Celine-Serry closed 3 months ago

Celine-Serry commented 4 months ago

Describe the issue

When I was using kb-python to produce an index and t2g file and to align my reads to this index, then I find that I only processed 3.5 million reads, whereas the tutorial I am following processes 66 million reads, to get the data i've used: https://drive.google.com/drive/folders/1DbLRO4kv-y3W06adFR26RdSaDPmfB4UA

Ive used both reference genome and cDNA of v108 and v111, but with both I get this result.

I am using kb-python (v. 0.28.2) through Ubuntu, wls2, on a Windows11 machine.

What is the exact command that was run? To make the index/t2g:

kb ref -i Homo_sapiens.GRCh38.cdna.all.index --overwrite -g t2g.txt -f1 Homo_sapiens.GRCh38.cdna.all.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa Homo_sapiens.GRCh38.111.gtf

To do read alignment:

kb count pbmc_1k_v3_S1_mergedLanes_R1.fastq.gz pbmc_1k_v3_S1_mergedLanes_R2.fastq.gz -i Homo_sapiens.GRCh38.cdna.all.index -x 10XV3 -g t2g.txt -t 8 --cellranger

Command output (with --verbose flag) No errors, the run worked fine, it just did not process all of the reads (which I find out when checking quality control stats in R):

after running kb-ref:

[2024-02-23 16:45:06,244]    INFO [ref] Preparing Homo_sapiens.GRCh38.dna.primary_assembly.fa, Homo_sapiens.GRCh38.111.gtf
[2024-02-23 16:46:20,650]    INFO [ref] Splitting genome Homo_sapiens.GRCh38.dna.primary_assembly.fa into cDNA at /mnt/e/test_2_23_2024_v111/tmp/tmpj2_ba91b
[2024-02-23 16:47:59,948]    INFO [ref] Concatenating 1 cDNAs to Homo_sapiens.GRCh38.cdna.all.fa
[2024-02-23 16:48:15,692]    INFO [ref] Creating transcript-to-gene mapping at t2g.txt
[2024-02-23 16:48:23,078]    INFO [ref] Indexing Homo_sapiens.GRCh38.cdna.all.fa to Homo_sapiens.GRCh38.cdna.all.index

after running kb-count:

[2024-02-23 17:25:46,459]    INFO [count] Using index Homo_sapiens.GRCh38.cdna.all.index to generate BUS file to . from
[2024-02-23 17:25:46,459]    INFO [count]         pbmc_1k_v3_S1_mergedLanes_R1.fastq.gz
[2024-02-23 17:25:46,459]    INFO [count]         pbmc_1k_v3_S1_mergedLanes_R2.fastq.gz
[2024-02-23 17:26:21,027]    INFO [count] Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2024-02-23 17:26:24,137]    INFO [count] On-list not provided
[2024-02-23 17:26:24,137]    INFO [count] Copying pre-packaged 10XV3 on-list to .
[2024-02-23 17:26:27,417]    INFO [count] Inspecting BUS file ./tmp/output.s.bus
[2024-02-23 17:26:33,232]    INFO [count] Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with on-list ./10x_version3_whitelist.txt
[2024-02-23 17:26:49,066]    INFO [count] Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2024-02-23 17:26:51,931]    INFO [count] Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus
[2024-02-23 17:26:59,200]    INFO [count] Writing gene names to file ./counts_unfiltered/cells_x_genes.genes.names.txt
[2024-02-23 17:26:59,815] WARNING [count] 22707 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2024-02-23 17:26:59,957]    INFO [count] Writing matrix in cellranger format to ./counts_unfiltered/cellranger

when my supervisor does these exact steps, with the same data and also reference genome v111 (on kb-python v 0.28.2) then she does get 66 million reads processed. Do you have any idea why I am not able to process as much reads on my console while performing the same steps, on the same data, with the same version?

Screenshot 2024-02-23 162238 Screenshot 2024-02-23 104458

Yenaled commented 4 months ago

This is not a kb-python issue; the number of reads processed (reported by kb-python) is just the number of reads in your FASTQ file. You should probably check your R1 and R2 FASTQ files by running:

zcat < name_of_fastq_file_R1.fastq.gz|wc -l
zcat < name_of_fastq_file_R2.fastq.gz|wc -l

to get the number of lines in your FASTQ files.

Celine-Serry commented 4 months ago

zcat < pbmc_1k_v3_S1_mergedLanes_R1.fastq.gz|wc -l returns: 266407548

zcat < pbmc_1k_v3_S1_mergedLanes_R2.fastq.gz|wc -l returns:

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
14101600
Yenaled commented 4 months ago

Yup, as you can see, your R2 file is incomplete (either you downloaded a partial file, e.g. your internet connection broke as it was downloading, or it somehow got corrupted). Just download that file again.

Celine-Serry commented 4 months ago

Thanks alot! I will try that right now.

Celine-Serry commented 4 months ago

ah after re-downloading the data its 266407548, the same as for R1. Thanks alot for solving this for me!

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days