Closed DamienTan closed 2 years ago
Hi, for smart-seq2, you should provide a batch file instead of FASTQ files. The batch file should be formatted as:
cell1 cell1_1.fastq.gz cell1_1.fastq.gz
cell2 cell2_1.fastq.gz cell2_1.fastq.gz
cell3 cell3_1.fastq.gz cell3_1.fastq.gz
Without a batch file, it asks for four files because it's assuming everything is in multiplexed format (with the cell barcodes in an I1.fastq and I2.fastq file and your paired-end reads in an R1.fastq and R2.fastq file).
Basically, replace your /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*2P.fq.gz
with a batch.txt
file.
Basically, replace your
/data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*1P.fq.gz /data/results/cxm/lncrna/data/trimgalore_trimmomatic_data/2C-1_*2P.fq.gz
with abatch.txt
file.
@Yenaled Very appreciate for your prompt reply! You means that I should generate a batch file which includes three columns, and my input files are not FASTQs, but a batch file, such as
kb count -i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx \
-g /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.release-106.t2g.txt \
-x SMARTSEQ2 \
--parity paired \
--strand unstranded \
-t 32 \
-o ./kb_genecount \
/data/results/cxm/lncrna/kb-python/kb_ref/batch.txt
Is that right?
Correct
Correct
Awesome! Very appreciate!!
@Yenaled Sir, I have another problem. When I finished running the kb count
command, and the output files are here.
$ ls -lR
.:
total 459280
drwxrwxr-x. 2 cxm cxm 4096 Jun 6 13:41 counts_unfiltered
-rw-rw-r--. 1 cxm cxm 342595 Jun 6 13:41 flens.txt
-rw-rw-r--. 1 cxm cxm 79993433 Jun 6 13:41 index.saved
-rw-rw-r--. 1 cxm cxm 287 Jun 6 13:41 inspect.json
-rw-rw-r--. 1 cxm cxm 1967 Jun 6 13:41 kb_info.json
-rw-rw-r--. 1 cxm cxm 2550 Jun 6 13:41 matrix.barcodes
-rw-rw-r--. 1 cxm cxm 1365 Jun 6 13:41 matrix.cells
-rw-rw-r--. 1 cxm cxm 101495311 Jun 6 13:41 matrix.ec
-rw-rw-r--. 1 cxm cxm 283983313 Jun 6 13:41 output.bus
-rw-rw-r--. 1 cxm cxm 601 Jun 6 13:41 run_info.json
-rw-rw-r--. 1 cxm cxm 4442232 Jun 6 13:41 transcripts.txt
./counts_unfiltered:
total 15632
-rw-rw-r--. 1 cxm cxm 2550 Jun 6 13:41 cells_x_genes.barcodes.txt
-rw-rw-r--. 1 cxm cxm 1122200 Jun 6 13:41 cells_x_genes.genes.txt
-rw-rw-r--. 1 cxm cxm 14880367 Jun 6 13:41 cells_x_genes.mtx
But I didn't find a gene count expression matrix like gene
x cell
format in the directory counts_unfiltered
. There is some details in file cells_x_genes.mtx
%%MatrixMarket matrix coordinate real general
%
%
150 61552 1226714
1 6 296
1 12 36
1 20 4
1 57 2
1 63 3
1 70 37
1 72 1
1 74 1
1 77 4
1 78 1036
1 92 1
1 119 3
1 138 1
1 164 14
1 186 4
1 202 1
1 203 7
1 212 303
1 231 141
1 246 4160
1 259 5445
I don't know what was wrong with my commands, so frustrated about that.
That cells_x_genes.mtx file is your gene count expression matrix. There are 61552 genes x 150 cells. You can read that .mtx file into python or R following the tutorials on www.kallistobus.tools
Also, another tip: When using kb count on smart-seq2 data, I recommend adding the --tcc option. In that way, multimapping reads are handled properly (rather than discarded) and you'll also get transcript-level expression. (You obviously wouldn't use the --tcc option for 10X data because 10X data is UMI end-tagged so you can't get transcript-level abundances anyway).
That cells_x_genes.mtx file is your gene count expression matrix. There are 61552 genes x 150 cells. You can read that .mtx file into python or R following the tutorials on www.kallistobus.tools
Also, another tip: When using kb count on smart-seq2 data, I recommend adding the --tcc option. In that way, multimapping reads are handled properly (rather than discarded) and you'll also get transcript-level expression. (You obviously wouldn't use the --tcc option for 10X data because 10X data is UMI end-tagged so you can't get transcript-level abundances anyway).
Yep! I also want to get the transcript-level expression(or in other words, I just only want to get transcript-level expression for my downstream ananlysis such as differential expression transcripts). It is more meaningful for my analysis than gene-level expression. I'll try to add --tcc
option immediately. The commands below is ok?
kb count --tcc \
-i /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.cdna.all.release-106_k31.idx \
-g /data/results/cxm/lncrna/kb-python/kb_ref/Homo_sapiens.GRCh38.release-106.t2g.txt \
-x SMARTSEQ2 \
--parity paired \
--strand unstranded \
-t 32 \
-o ./kb_tcc \
/data/results/cxm/lncrna/kb-python/kb_ref/batch.txt
You are actually my deliverer!!!
Yup, that looks great to me :)
Yup, that looks great to me :)
@Yenaled Sir, sorry again to ask you for help. After I added --tcc
option, I got the output files.
$ ls -lR
.:
total 459284
drwxrwxr-x. 2 cxm cxm 4096 Jun 6 18:14 counts_unfiltered
-rw-rw-r--. 1 cxm cxm 342595 Jun 6 18:14 flens.txt
-rw-rw-r--. 1 cxm cxm 79993433 Jun 6 18:14 index.saved
-rw-rw-r--. 1 cxm cxm 287 Jun 6 18:14 inspect.json
-rw-rw-r--. 1 cxm cxm 2180 Jun 6 18:18 kb_info.json
-rw-rw-r--. 1 cxm cxm 2550 Jun 6 18:14 matrix.barcodes
-rw-rw-r--. 1 cxm cxm 1365 Jun 6 18:14 matrix.cells
-rw-rw-r--. 1 cxm cxm 101495311 Jun 6 18:14 matrix.ec
-rw-rw-r--. 1 cxm cxm 283983313 Jun 6 18:14 output.bus
drwxrwxr-x. 2 cxm cxm 4096 Jun 6 18:18 quant_unfiltered
-rw-rw-r--. 1 cxm cxm 589 Jun 6 18:14 run_info.json
-rw-rw-r--. 1 cxm cxm 4442232 Jun 6 18:14 transcripts.txt
./counts_unfiltered:
total 215160
-rw-rw-r--. 1 cxm cxm 2550 Jun 6 18:14 cells_x_tcc.barcodes.txt
-rw-rw-r--. 1 cxm cxm 101495311 Jun 6 18:14 cells_x_tcc.ec.txt
-rw-rw-r--. 1 cxm cxm 118811177 Jun 6 18:14 cells_x_tcc.mtx
./quant_unfiltered:
total 123836
-rw-rw-r--. 1 cxm cxm 1122200 Jun 6 18:18 genes.txt
-rw-rw-r--. 1 cxm cxm 18939580 Jun 6 18:18 matrix.abundance.gene.mtx
-rw-rw-r--. 1 cxm cxm 23726189 Jun 6 18:18 matrix.abundance.gene.tpm.mtx
-rw-rw-r--. 1 cxm cxm 36708261 Jun 6 18:18 matrix.abundance.mtx
-rw-rw-r--. 1 cxm cxm 41854013 Jun 6 18:18 matrix.abundance.tpm.mtx
-rw-rw-r--. 1 cxm cxm 2860 Jun 6 18:18 matrix.fld.tsv
-rw-rw-r--. 1 cxm cxm 4442232 Jun 6 18:14 transcripts.txt
I am puzzled about the differences between counts_unfiltered
and quant_unfiltered
these two subdirectories. Subdirecory counts_unfiltered
contains three files which I guess are the same as the output without add --tcc
option but are focus on transcript-level.
here are some details about the cells_x_tcc.mtx
%%MatrixMarket matrix coordinate real general
%
%
150 1440216 8874477
1 16 296
1 31 36
1 64 4
1 268 37
1 278 1
1 280 1
1 285 4
1 288 1036
1 315 1
1 424 3
1 504 1
About quant_unfiltered
, It was generated after counts_unfiltered
and included more files
for matrix.abundance.gene.mtx
%%MatrixMarket matrix coordinate real general
150 61552 1333276
1 6 324.207
1 12 46.3732
1 20 4.3029
1 42 6.58285
1 57 2
1 63 3
1 70 43.0189
1 72 1.00487
1 74 1.01721
1 77 4
1 78 1070.98
1 82 2.84005
1 92 1
1 103 2
1 119 3.10936
1 138 1.38036
1 143 0.616418
1 164 18.6205
1 186 4
for matrix.abundance.mtx
%%MatrixMarket matrix coordinate real general
150 246511 2248688
1 16 324.207
1 31 46.3732
1 64 4.3029
1 131 6.58285
1 211 2
1 250 0.422741
1 255 1.62599
1 257 0.951268
1 268 43.0189
1 278 1.00487
1 280 1.01721
1 285 4
These three .mtx files are the same in the first column(150 cells) but different in the second column (genes or transcripts). 246511
in matrix.abundance.mtx
means transcripts and I checked in t2g.txt
using wc -l t2g.txt
. For 61552
in matrix.abundance.gene.mtx
, it means genes including protein-coding genes and non-coding genes, etc. And for me, I would not use matrix.abundance.mtx for my downstream transcript-level analysis. But for 1440216
in cells_x_tcc.mtx
, I don't know what it means. It is so large! I really don't know which transcript-level .mtx
file I should use in the next differential expression analysis, cells_x_tcc.mtx
or matrix.abundance.mtx
?
Don't worry about the cells_x_tcc file -- those are transcript-compatibility counts (and are based off equivalence classes -- read the kallisto paper for more details).
Use matrix.abundance.gene.tpm.mtx (for gene-level) and matrix.abundance.tpm.mtx (for transcript-level).
You want to use the ones with the tpm because TPM is the proper way to length-normalize smart-seq2 data (you can read more about TPM in the many published RNA-seq papers online).
Don't worry about the cells_x_tcc file -- those are transcript-compatibility counts (and are based off equivalence classes -- read the kallisto paper for more details).
Use matrix.abundance.gene.tpm.mtx (for gene-level) and matrix.abundance.tpm.mtx (for transcript-level).
You want to use the ones with the tpm because TPM is the proper way to length-normalize smart-seq2 data (you can read more about TPM in the many published RNA-seq papers online).
Roger that! Sir, you said .tpm.mtx
files are normalized with TPM. For matrix.abundance.gene.mtx
and matrix.abundance.mtx
, they are both raw read count matrix at the same time, am I right? Because I want to use the raw count expression matrix as input file for the DESeq2
or SCDE
these differential expression packages which ask the input should be a raw count integer matrix. And I know I should use round()
function to preprocess the kallisto output. unnormalized raw count data for single-cell RNA seq may obeys negative binomial distribution
or zero-inflated negative binomial distribution
, But I thouht normalized data would not obey these distributions, please correct my misunderstandings.
If you need the raw counts for downstream analysis (e.g. DESeq2), then use the other matrix files (not the TPM ones).
If you need the raw counts for downstream analysis (e.g. DESeq2), then use the other matrix files (not the TPM ones).
OK! Thank you sir, I know what I should do next. So grateful for you to give me a lot of help :-D
Dear @Yenaled
I had another question when I read the matrix.abundance.mtx
into R following the tutorial on https://www.kallistobus.tools/tutorials/kb_getting_started/r/kb_intro_2_r/
funtion read_count_output
can return a 2D matrix if I input .mtx
, .barcodes.txt
and .genes.txt
read_count_output <- function(dir, name) {
dir <- normalizePath(dir, mustWork = TRUE)
m <- readMM(paste0(dir, "/", name, ".mtx"))
m <- Matrix::t(m)
m <- as(m, "dgCMatrix")
# The matrix read has cells in rows
ge <- ".genes.txt"
genes <- readLines(file(paste0(dir, "/", name, ge)))
barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
colnames(m) <- barcodes
rownames(m) <- genes
return(m)
}
However, under the folder quant_unfiltered
, I can't find .barcodes.txt
and .genes.txt
. They are both under other folder counts_unfiltered
. I notices that the tutorials are all for the gene counts mtx. Are there other tutorials aim at how read the transcript-level counts matrix into R and generate a primary 2D expression matrix like this?
cell1 cell2 cell3 ... cell150
transcript ID1
transcript ID2
transcript ID3
...
transcript ID246511
Look forward to your reply!
There is a transcripts.txt in the quant_unfiltered folder.
For the barcodes.txt, you can use the one in the counts_unfiltered folder (there are technically no barcodes in smart-seq2 data; so kallisto just generates a bunch of fake unique barcodes).
There are no tutorials for transcript analysis currently; this is because most single-cell technologies (e.g. 10X) are more suitable for gene-level analysis.
You can customize the R script from the tutorial to make that matrix though.
Just so you are aware, a dense matrix of size 150 x 246511 will be fairly large.
There is a transcripts.txt in the quant_unfiltered folder.
For the barcodes.txt, you can use the one in the counts_unfiltered folder (there are technically no barcodes in smart-seq2 data; so kallisto just generates a bunch of fake unique barcodes).
There are no tutorials for transcript analysis currently; this is because most single-cell technologies (e.g. 10X) are more suitable for gene-level analysis.
You can customize the R script from the tutorial to make that matrix though.
Just so you are aware, a dense matrix of size 150 x 246511 will be fairly large.
Thanks for your quick reply! I renamed matrix.abundance.mtx
, cells_x_tcc.barcodes.txt
and trasncripts.txt
to cells_x_trans.mtx
, cells_x_trans.barcodes.txt
and cells_x_trans.transcripts.txt
respectively. And I slightly modified the function read_count_output
read_count_output <- function(dir, name) {
dir <- normalizePath(dir, mustWork = TRUE)
m <- readMM(paste0(dir, "/", name, ".mtx"))
m <- Matrix::t(m)
m <- as(m, "dgCMatrix")
# The matrix read has cells in rows
tr <- ".transcripts.txt"
transcripts <- readLines(file(paste0(dir, "/", name, tr)))
barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
colnames(m) <- barcodes
rownames(m) <- transcripts
return(m)
}
Everything seems all rignt before I run the next step code
res_mat <- read_count_output("quant_unfiltered", name = "cells_x_trans")
the R terminal console return
Warning messages:
1: In for (i in seq_along(snames)) { :
closing unused connection 5 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.barcodes.txt)
2: In for (i in seq_along(snames)) { :
closing unused connection 4 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.transcripts.txt)
3: In for (i in seq_along(snames)) { :
closing unused connection 3 (E:\Work\cxm_lnc\kb_tcc\quant_unfiltered/cells_x_trans.transcripts.txt)
What is wrong with it? Could you give me some advice to find out the reason, sir?
I'm not sure -- those are just warning messages so see your function is returning the matrix by trying to print it out.
Also double check that those files in the warning messages actually exist.
I'm not sure -- those are just warning messages so see your function is returning the matrix by trying to print it out.
Also double check that those files in the warning messages actually exist.
I try to print this res_mat
matrix out, it works. That is inspiring!
dim(res_mat)
[1] 246511 150
There is one thing I want to confirm -- the .barcodes.txt
is not the real barcodes. But this file contains 150 rows (it means one row corresponds one cell). In other words, barcodes.txt match the first column in batch file
which as one of the input files at kb count
step, do I understand correctly?
AAAAAAAAAAAAAAAA 2C-1
AAAAAAAAAAAAAAAC 2C-10
AAAAAAAAAAAAAAAG 2C-2
AAAAAAAAAAAAAAAT 2C-3
AAAAAAAAAAAAAACA 2C-4
AAAAAAAAAAAAAACC 2C-5
AAAAAAAAAAAAAACG 2C-6
AAAAAAAAAAAAAACT 2C-7
AAAAAAAAAAAAAAGA 2C-8
AAAAAAAAAAAAAAGC 2C-9
... ...
and so on
Yes, that is correct
Yes, that is correct
Very nice! Sir, about differential transcript expression (DTE) analysis, would you have some advice? the normal differential expression packages like DESeq2 or SCDE may ask a gene counts matrix as an input in their tutorials. However they do not talk about whether a transcript count matrix is suitable. It is hard for me to decide if I can use these packages or what packages should I use for DTE analysis. Could you please point me publications regarding this? Thanks a lot!
I'll be honest: I don't know too much about differential transcript expression. I can handle most kallisto issues to generate a count matrix, but downstream differential analysis (especially at the transcript-level) is a bit outside my field of expertise. I recommend asking elsewhere for advice on it.
I'll be honest: I don't know too much about differential transcript expression. I can handle most kallisto issues to generate a count matrix, but downstream differential analysis (especially at the transcript-level) is a bit outside my field of expertise. I recommend asking elsewhere for advice on it.
That's fine! I'll find some publications about DTE by myself or ask others who know this for some advice. Still appreciate you giving me so much help. \^o^/
@Yenaled hi, sir. thanks for directing me here and i've got some results with
kb count -i macaca_fascicularis.idx -g t2g.txt -x SMARTSEQ2 --loom --parity paired -o kb/ samplemeta.txt
and some information on the screen
[2022-07-07 17:23:38,298] INFO [count] Using index /data1/luofc/hanx/database/kb-reference/macaca_fascicularis/macaca_fascicularis.idx to generate BUS file to /data1/luofc/202206embryo/HUAIXIAOMA/kb/ from [2022-07-07 17:23:38,298] INFO [count] /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/tmpcj3k772m [2022-07-07 21:48:51,894] INFO [count] Sorting BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/output.bus to /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:48:56,936] INFO [count] Inspecting BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:48:58,053] INFO [count] Generating count matrix /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/cells_x_genes from BUS file /data1/luofc/202206embryo/HUAIXIAOMA/kb/tmp/output.s.bus [2022-07-07 21:49:00,910] INFO [count] Reading matrix /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/cells_x_genes.mtx [2022-07-07 21:49:02,270] WARNING [count] 5908 gene IDs do not have corresponding gene names. These genes will use their gene IDs instead. [2022-07-07 21:49:02,276] INFO [count] Writing matrix to loom /data1/luofc/202206embryo/HUAIXIAOMA/kb/counts_unfiltered/adata.loom
my question is that i've aligned the fastqs with HISAT2 and got very low alignment rate(about 20%), and i wanna known the alignment rate of kb.
i checked the files in the
kb/ ├── counts_unfiltered │ ├── adata.loom │ ├── cells_x_genes.barcodes.txt │ ├── cells_x_genes.genes.txt │ └── cells_x_genes.mtx ├── flens.txt ├── index.saved ├── inspect.json ├── kb_info.json ├── matrix.barcodes ├── matrix.cells ├── matrix.ec ├── output.bus ├── run_info.json └── transcripts.txt
it seems that no files can i get the information of alignment rate, would you please give some advice for me to get the alignment rate?
thanks!!!
^ look into the JSON files -- run_info.json gives you the alignment rate
Describe the issue Hi! I met an error when I run
kb count
to generate a gene count matrix, the outputError: Number of files (2) does not match number of input files required by technology BULK (4)
. I was confused that except the input pair-end FASTQs provited for SMART-Seq2, what other input files should I need?kb-python info
What is the exact command that was run?
Command output (with
--verbose
flag)Thank you!