kb count ERROR for SMART-Seq2 data

DamienTan commented 2 years ago

Describe the issue Hi! I met an error when I run kb count to generate a gene count matrix, the output Error: Number of files (2) does not match number of input files required by technology BULK (4). I was confused that except the input pair-end FASTQs provited for SMART-Seq2, what other input files should I need?

kb-python info

kb_python 0.27.2
kallisto: 0.48.0 (/home/chenxianming/.local/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto)
bustools: 0.41.0 (/home/chenxianming/.local/lib/python3.10/site-packages/kb_python/bins/linux/bustools/bustools)

What is the exact command that was run?

kb count -i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx \
-g /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.release-106.t2g.txt \
-x SMARTSEQ2 \
--parity paired \
--strand unstranded \
-t 32 \
-o ./kb_genecount \
/data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*2P.fq.gz

Command output (with --verbose flag)

[2022-06-05 23:36:58,869] WARNING [main] FASTQs were provided for technology `SMARTSEQ2`. Assuming multiplexed samples. For demultiplexed samples, provide a batch textfile.
[2022-06-05 23:37:00,982]    INFO [count] Using index /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx to generate BUS file to ./kb_genecount from
[2022-06-05 23:37:00,983]    INFO [count]         /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_galore_trimmomatic_1P.fq.gz
[2022-06-05 23:37:00,983]    INFO [count]         /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_galore_trimmomatic_2P.fq.gz
[2022-06-05 23:37:02,088]   ERROR [count] 
Error: Number of files (2) does not match number of input files required by technology BULK (4)
kallisto 0.48.0
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
pseudoalignment
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
-x, --technology=STRING       Single-cell technology used
-l, --list                    List all single-cell technologies supported
-B, --batch=FILE              Process files listed in FILE
-t, --threads=INT             Number of threads to use (default: 1)
-b, --bam                     Input file is a BAM file
-n, --num                     Output number of read in flag column (incompatible with --bam)
-T, --tag=STRING              5′ tag sequence to identify UMI reads for certain technologies
--fr-stranded             Strand specific reads for UMI-tagged reads, first read forward
--rf-stranded             Strand specific reads for UMI-tagged reads, first read reverse
--unstranded              Treat all read as non-strand-specific
--paired                  Treat reads as paired
--genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
--verbose                 Print out progress information every 1M proccessed reads
[2022-06-05 23:37:02,089]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/main.py", line 1301, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/main.py", line 538, in parse_count
    count(
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/count.py", line 1274, in count
    bus_result = kallisto_bus(
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/validate.py", line 116, in inner
    results = func(*args, **kwargs)
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/count.py", line 190, in kallisto_bus
    run_executable(command)
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
    return func(*args, **kwargs)
  File "/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
    raise sp.CalledProcessError(p.returncode, ' '.join(command))
subprocess.CalledProcessError: Command '/data/bio-tools/anaconda3/envs/lncrna/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx -o ./kb_genecount -x BULK -t 32 --paired --unstranded /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_galore_trimmomatic_1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_galore_trimmomatic_2P.fq.gz' returned non-zero exit status 1.

Thank you!

Yenaled commented 2 years ago

Hi, for smart-seq2, you should provide a batch file instead of FASTQ files. The batch file should be formatted as:

cell1 cell1_1.fastq.gz cell1_1.fastq.gz
cell2 cell2_1.fastq.gz cell2_1.fastq.gz
cell3 cell3_1.fastq.gz cell3_1.fastq.gz

Without a batch file, it asks for four files because it's assuming everything is in multiplexed format (with the cell barcodes in an I1.fastq and I2.fastq file and your paired-end reads in an R1.fastq and R2.fastq file).

Yenaled commented 2 years ago

Basically, replace your /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*2P.fq.gz with a batch.txt file.

DamienTan commented 2 years ago

Basically, replace your /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*2P.fq.gz with a batch.txt file.

@Yenaled Very appreciate for your prompt reply! You means that I should generate a batch file which includes three columns, and my input files are not FASTQs, but a batch file, such as

kb count -i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx \
-g /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.release-106.t2g.txt \
-x SMARTSEQ2 \
--parity paired \
--strand unstranded \
-t 32 \
-o ./kb_genecount \
/data/results/cxm/lncrna/kb-python/kb_ref/batch.txt

Is that right?

Yenaled commented 2 years ago

Correct

DamienTan commented 2 years ago

Correct

Awesome! Very appreciate!!

DamienTan commented 2 years ago

@Yenaled Sir, I have another problem. When I finished running the kb count command, and the output files are here.

$ ls -lR
.:
total 459280
drwxrwxr-x. 2 cxm cxm      4096 Jun  6 13:41 counts_unfiltered
-rw-rw-r--. 1 cxm cxm    342595 Jun  6 13:41 flens.txt
-rw-rw-r--. 1 cxm cxm  79993433 Jun  6 13:41 index.saved
-rw-rw-r--. 1 cxm cxm       287 Jun  6 13:41 inspect.json
-rw-rw-r--. 1 cxm cxm      1967 Jun  6 13:41 kb_info.json
-rw-rw-r--. 1 cxm cxm      2550 Jun  6 13:41 matrix.barcodes
-rw-rw-r--. 1 cxm cxm      1365 Jun  6 13:41 matrix.cells
-rw-rw-r--. 1 cxm cxm 101495311 Jun  6 13:41 matrix.ec
-rw-rw-r--. 1 cxm cxm 283983313 Jun  6 13:41 output.bus
-rw-rw-r--. 1 cxm cxm       601 Jun  6 13:41 run_info.json
-rw-rw-r--. 1 cxm cxm   4442232 Jun  6 13:41 transcripts.txt

./counts_unfiltered:
total 15632
-rw-rw-r--. 1 cxm cxm     2550 Jun  6 13:41 cells_x_genes.barcodes.txt
-rw-rw-r--. 1 cxm cxm  1122200 Jun  6 13:41 cells_x_genes.genes.txt
-rw-rw-r--. 1 cxm cxm 14880367 Jun  6 13:41 cells_x_genes.mtx

But I didn't find a gene count expression matrix like gene x cell format in the directory counts_unfiltered. There is some details in file cells_x_genes.mtx

%%MatrixMarket matrix coordinate real general
%
%                                              
150 61552 1226714
1 6 296
1 12 36
1 20 4
1 57 2
1 63 3
1 70 37
1 72 1
1 74 1
1 77 4
1 78 1036
1 92 1
1 119 3
1 138 1
1 164 14
1 186 4
1 202 1
1 203 7
1 212 303
1 231 141
1 246 4160
1 259 5445

I don't know what was wrong with my commands, so frustrated about that.

Yenaled commented 2 years ago

That cells_x_genes.mtx file is your gene count expression matrix. There are 61552 genes x 150 cells. You can read that .mtx file into python or R following the tutorials on www.kallistobus.tools

Also, another tip: When using kb count on smart-seq2 data, I recommend adding the --tcc option. In that way, multimapping reads are handled properly (rather than discarded) and you'll also get transcript-level expression. (You obviously wouldn't use the --tcc option for 10X data because 10X data is UMI end-tagged so you can't get transcript-level abundances anyway).

DamienTan commented 2 years ago

That cells_x_genes.mtx file is your gene count expression matrix. There are 61552 genes x 150 cells. You can read that .mtx file into python or R following the tutorials on www.kallistobus.tools

Also, another tip: When using kb count on smart-seq2 data, I recommend adding the --tcc option. In that way, multimapping reads are handled properly (rather than discarded) and you'll also get transcript-level expression. (You obviously wouldn't use the --tcc option for 10X data because 10X data is UMI end-tagged so you can't get transcript-level abundances anyway).

Yep! I also want to get the transcript-level expression(or in other words, I just only want to get transcript-level expression for my downstream ananlysis such as differential expression transcripts). It is more meaningful for my analysis than gene-level expression. I'll try to add --tcc option immediately. The commands below is ok?

kb count --tcc \
-i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx \
-g /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.release-106.t2g.txt \
-x SMARTSEQ2 \
--parity paired \
--strand unstranded \
-t 32 \
-o ./kb_tcc \
/data/results/cxm/lncrna/kb-python/kb_ref/batch.txt

You are actually my deliverer!!!

Yenaled commented 2 years ago

Yup, that looks great to me :)

DamienTan commented 2 years ago

Yup, that looks great to me :)

@Yenaled Sir, sorry again to ask you for help. After I added --tcc option, I got the output files.

$ ls -lR
.:
total 459284
drwxrwxr-x. 2 cxm cxm      4096 Jun  6 18:14 counts_unfiltered
-rw-rw-r--. 1 cxm cxm    342595 Jun  6 18:14 flens.txt
-rw-rw-r--. 1 cxm cxm  79993433 Jun  6 18:14 index.saved
-rw-rw-r--. 1 cxm cxm       287 Jun  6 18:14 inspect.json
-rw-rw-r--. 1 cxm cxm      2180 Jun  6 18:18 kb_info.json
-rw-rw-r--. 1 cxm cxm      2550 Jun  6 18:14 matrix.barcodes
-rw-rw-r--. 1 cxm cxm      1365 Jun  6 18:14 matrix.cells
-rw-rw-r--. 1 cxm cxm 101495311 Jun  6 18:14 matrix.ec
-rw-rw-r--. 1 cxm cxm 283983313 Jun  6 18:14 output.bus
drwxrwxr-x. 2 cxm cxm      4096 Jun  6 18:18 quant_unfiltered
-rw-rw-r--. 1 cxm cxm       589 Jun  6 18:14 run_info.json
-rw-rw-r--. 1 cxm cxm   4442232 Jun  6 18:14 transcripts.txt

./counts_unfiltered:
total 215160
-rw-rw-r--. 1 cxm cxm      2550 Jun  6 18:14 cells_x_tcc.barcodes.txt
-rw-rw-r--. 1 cxm cxm 101495311 Jun  6 18:14 cells_x_tcc.ec.txt
-rw-rw-r--. 1 cxm cxm 118811177 Jun  6 18:14 cells_x_tcc.mtx

./quant_unfiltered:
total 123836
-rw-rw-r--. 1 cxm cxm  1122200 Jun  6 18:18 genes.txt
-rw-rw-r--. 1 cxm cxm 18939580 Jun  6 18:18 matrix.abundance.gene.mtx
-rw-rw-r--. 1 cxm cxm 23726189 Jun  6 18:18 matrix.abundance.gene.tpm.mtx
-rw-rw-r--. 1 cxm cxm 36708261 Jun  6 18:18 matrix.abundance.mtx
-rw-rw-r--. 1 cxm cxm 41854013 Jun  6 18:18 matrix.abundance.tpm.mtx
-rw-rw-r--. 1 cxm cxm     2860 Jun  6 18:18 matrix.fld.tsv
-rw-rw-r--. 1 cxm cxm  4442232 Jun  6 18:14 transcripts.txt

I am puzzled about the differences between counts_unfiltered and quant_unfiltered these two subdirectories. Subdirecory counts_unfiltered contains three files which I guess are the same as the output without add --tcc option but are focus on transcript-level.

here are some details about the cells_x_tcc.mtx

%%MatrixMarket matrix coordinate real general
%
%                                            
150 1440216 8874477
1 16 296
1 31 36
1 64 4
1 268 37
1 278 1
1 280 1
1 285 4
1 288 1036
1 315 1
1 424 3
1 504 1

About quant_unfiltered, It was generated after counts_unfiltered and included more files

for matrix.abundance.gene.mtx

%%MatrixMarket matrix coordinate real general
150     61552   1333276
1       6       324.207
1       12      46.3732
1       20      4.3029
1       42      6.58285
1       57      2
1       63      3
1       70      43.0189
1       72      1.00487
1       74      1.01721
1       77      4
1       78      1070.98
1       82      2.84005
1       92      1
1       103     2
1       119     3.10936
1       138     1.38036
1       143     0.616418
1       164     18.6205
1       186     4

for matrix.abundance.mtx

%%MatrixMarket matrix coordinate real general
150     246511  2248688
1       16      324.207
1       31      46.3732
1       64      4.3029
1       131     6.58285
1       211     2
1       250     0.422741
1       255     1.62599
1       257     0.951268
1       268     43.0189
1       278     1.00487
1       280     1.01721
1       285     4

These three .mtx files are the same in the first column(150 cells) but different in the second column (genes or transcripts). 246511 in matrix.abundance.mtx means transcripts and I checked in t2g.txt using wc -l t2g.txt. For 61552 in matrix.abundance.gene.mtx, it means genes including protein-coding genes and non-coding genes, etc. And for me, I would not use matrix.abundance.mtx for my downstream transcript-level analysis. But for 1440216 in cells_x_tcc.mtx, I don't know what it means. It is so large! I really don't know which transcript-level .mtx file I should use in the next differential expression analysis, cells_x_tcc.mtx or matrix.abundance.mtx?

Yenaled commented 2 years ago

Don't worry about the cells_x_tcc file -- those are transcript-compatibility counts (and are based off equivalence classes -- read the kallisto paper for more details).

Use matrix.abundance.gene.tpm.mtx (for gene-level) and matrix.abundance.tpm.mtx (for transcript-level).

You want to use the ones with the tpm because TPM is the proper way to length-normalize smart-seq2 data (you can read more about TPM in the many published RNA-seq papers online).

DamienTan commented 2 years ago

Don't worry about the cells_x_tcc file -- those are transcript-compatibility counts (and are based off equivalence classes -- read the kallisto paper for more details).

Use matrix.abundance.gene.tpm.mtx (for gene-level) and matrix.abundance.tpm.mtx (for transcript-level).

You want to use the ones with the tpm because TPM is the proper way to length-normalize smart-seq2 data (you can read more about TPM in the many published RNA-seq papers online).

Roger that! Sir, you said .tpm.mtx files are normalized with TPM. For matrix.abundance.gene.mtx and matrix.abundance.mtx, they are both raw read count matrix at the same time, am I right? Because I want to use the raw count expression matrix as input file for the DESeq2 or SCDE these differential expression packages which ask the input should be a raw count integer matrix. And I know I should use round() function to preprocess the kallisto output. unnormalized raw count data for single-cell RNA seq may obeys negative binomial distribution or zero-inflated negative binomial distribution, But I thouht normalized data would not obey these distributions, please correct my misunderstandings.

Yenaled commented 2 years ago

If you need the raw counts for downstream analysis (e.g. DESeq2), then use the other matrix files (not the TPM ones).

DamienTan commented 2 years ago

If you need the raw counts for downstream analysis (e.g. DESeq2), then use the other matrix files (not the TPM ones).

OK! Thank you sir, I know what I should do next. So grateful for you to give me a lot of help :-D

DamienTan commented 2 years ago

Dear @Yenaled I had another question when I read the matrix.abundance.mtx into R following the tutorial on https://www.kallistobus.tools/tutorials/kb_getting_started/r/kb_intro_2_r/ funtion read_count_output can return a 2D matrix if I input .mtx, .barcodes.txt and .genes.txt

read_count_output <- function(dir, name) {
  dir <- normalizePath(dir, mustWork = TRUE)
  m <- readMM(paste0(dir, "/", name, ".mtx"))
  m <- Matrix::t(m)
  m <- as(m, "dgCMatrix")
  # The matrix read has cells in rows
  ge <- ".genes.txt"
  genes <- readLines(file(paste0(dir, "/", name, ge)))
  barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
  colnames(m) <- barcodes
  rownames(m) <- genes
  return(m)
}

However, under the folder quant_unfiltered, I can't find .barcodes.txt and .genes.txt. They are both under other folder counts_unfiltered. I notices that the tutorials are all for the gene counts mtx. Are there other tutorials aim at how read the transcript-level counts matrix into R and generate a primary 2D expression matrix like this?

                cell1  cell2  cell3 ...  cell150
transcript ID1
transcript ID2
transcript ID3
...
transcript ID246511

Look forward to your reply!

Yenaled commented 2 years ago

There is a transcripts.txt in the quant_unfiltered folder.

For the barcodes.txt, you can use the one in the counts_unfiltered folder (there are technically no barcodes in smart-seq2 data; so kallisto just generates a bunch of fake unique barcodes).

There are no tutorials for transcript analysis currently; this is because most single-cell technologies (e.g. 10X) are more suitable for gene-level analysis.

You can customize the R script from the tutorial to make that matrix though.

Just so you are aware, a dense matrix of size 150 x 246511 will be fairly large.

DamienTan commented 2 years ago

There is a transcripts.txt in the quant_unfiltered folder.

For the barcodes.txt, you can use the one in the counts_unfiltered folder (there are technically no barcodes in smart-seq2 data; so kallisto just generates a bunch of fake unique barcodes).

There are no tutorials for transcript analysis currently; this is because most single-cell technologies (e.g. 10X) are more suitable for gene-level analysis.

You can customize the R script from the tutorial to make that matrix though.

Just so you are aware, a dense matrix of size 150 x 246511 will be fairly large.

Thanks for your quick reply! I renamed matrix.abundance.mtx, cells_x_tcc.barcodes.txt and trasncripts.txt to cells_x_trans.mtx , cells_x_trans.barcodes.txt and cells_x_trans.transcripts.txt respectively. And I slightly modified the function read_count_output

read_count_output <- function(dir, name) {
  dir <- normalizePath(dir, mustWork = TRUE)
  m <- readMM(paste0(dir, "/", name, ".mtx"))
  m <- Matrix::t(m)
  m <- as(m, "dgCMatrix")
  # The matrix read has cells in rows
  tr <- ".transcripts.txt"
  transcripts <- readLines(file(paste0(dir, "/", name, tr)))
  barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
  colnames(m) <- barcodes
  rownames(m) <- transcripts
  return(m)
}

Everything seems all rignt before I run the next step code res_mat <- read_count_output("quant_unfiltered", name = "cells_x_trans") the R terminal console return

Warning messages:
1: In for (i in seq_along(snames)) { :
  closing unused connection 5 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.barcodes.txt)   
2: In for (i in seq_along(snames)) { :
  closing unused connection 4 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.transcripts.txt)
3: In for (i in seq_along(snames)) { :
  closing unused connection 3 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.transcripts.txt)

What is wrong with it? Could you give me some advice to find out the reason, sir?

Yenaled commented 2 years ago

I'm not sure -- those are just warning messages so see your function is returning the matrix by trying to print it out.

Also double check that those files in the warning messages actually exist.

DamienTan commented 2 years ago

I'm not sure -- those are just warning messages so see your function is returning the matrix by trying to print it out.

Also double check that those files in the warning messages actually exist.

I try to print this res_mat matrix out, it works. That is inspiring!

dim(res_mat)
[1] 246511    150

There is one thing I want to confirm -- the .barcodes.txt is not the real barcodes. But this file contains 150 rows (it means one row corresponds one cell). In other words, barcodes.txt match the first column in batch file which as one of the input files at kb count step, do I understand correctly?

AAAAAAAAAAAAAAAA    2C-1
AAAAAAAAAAAAAAAC    2C-10 
AAAAAAAAAAAAAAAG    2C-2
AAAAAAAAAAAAAAAT    2C-3
AAAAAAAAAAAAAACA    2C-4
AAAAAAAAAAAAAACC    2C-5
AAAAAAAAAAAAAACG    2C-6
AAAAAAAAAAAAAACT    2C-7
AAAAAAAAAAAAAAGA    2C-8
AAAAAAAAAAAAAAGC    2C-9
...                 ...
and so on

Yenaled commented 2 years ago

Yes, that is correct

DamienTan commented 2 years ago

Yes, that is correct

Very nice! Sir, about differential transcript expression (DTE) analysis, would you have some advice? the normal differential expression packages like DESeq2 or SCDE may ask a gene counts matrix as an input in their tutorials. However they do not talk about whether a transcript count matrix is suitable. It is hard for me to decide if I can use these packages or what packages should I use for DTE analysis. Could you please point me publications regarding this? Thanks a lot!

Yenaled commented 2 years ago

I'll be honest: I don't know too much about differential transcript expression. I can handle most kallisto issues to generate a count matrix, but downstream differential analysis (especially at the transcript-level) is a bit outside my field of expertise. I recommend asking elsewhere for advice on it.

DamienTan commented 2 years ago

I'll be honest: I don't know too much about differential transcript expression. I can handle most kallisto issues to generate a count matrix, but downstream differential analysis (especially at the transcript-level) is a bit outside my field of expertise. I recommend asking elsewhere for advice on it.

That's fine! I'll find some publications about DTE by myself or ask others who know this for some advice. Still appreciate you giving me so much help. \^o^/

x1han commented 2 years ago

@Yenaled hi, sir. thanks for directing me here and i've got some results with kb count -i macaca_fascicularis.idx -g t2g.txt -x SMARTSEQ2 --loom --parity paired -o kb/ samplemeta.txt and some information on the screen [2022-07-07 17:23:38,298] INFO [count] Using index /data1/luofc/hanx/database/kb-reference/macaca_fascicularis/macaca_fascicularis.idx to generate BUS file to /data1/luofc/202206embryo/HUAIXIAOMA/kb/ from [2022-07-07 17:23:38,298] INFO [count] /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/tmpcj3k772m [2022-07-07 21:48:51,894] INFO [count] Sorting BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/output.bus to /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:48:56,936] INFO [count] Inspecting BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:48:58,053] INFO [count] Generating count matrix /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/cells_x_genes from BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:49:00,910] INFO [count] Reading matrix /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/cells_x_genes.mtx [2022-07-07 21:49:02,270] WARNING [count] 5908 gene IDs do not have corresponding gene names. These genes will use their gene IDs instead. [2022-07-07 21:49:02,276] INFO [count] Writing matrix to loom /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/adata.loom

my question is that i've aligned the fastqs with HISAT2 and got very low alignment rate(about 20%), and i wanna known the alignment rate of kb. i checked the files in the kb/ ├── counts_unfiltered │ ├── adata.loom │ ├── cells_x_genes.barcodes.txt │ ├── cells_x_genes.genes.txt │ └── cells_x_genes.mtx ├── flens.txt ├── index.saved ├── inspect.json ├── kb_info.json ├── matrix.barcodes ├── matrix.cells ├── matrix.ec ├── output.bus ├── run_info.json └── transcripts.txt it seems that no files can i get the information of alignment rate, would you please give some advice for me to get the alignment rate?

thanks!!!

Yenaled commented 2 years ago

^ look into the JSON files -- run_info.json gives you the alignment rate

pachterlab / kb_python

kb count ERROR for SMART-Seq2 data #167