PISA parse2 - Githubissues

L1angyan commented 2 years ago

您好，我第一次接触DNBelab C4平台的单细胞转录组的数据，我通过： $PISA parse2 -x C4 -t 8 FASTQ_R1 FASTQ_R2 -1 FASTQ_OUT 将FASTQ转换成FASTQ+。但通过质控或者有exacly matched barcode的reads普遍比总数低一个数量级，请问这种情况正常吗？还是说我的文库结构和PISA预设的C4文库结构不一样？期待收到您的解答

L1angyan commented 2 years ago

And another question, for DNBelab C4 platfrom, should I use -ignore-strand for PISA anno ?

shiquan commented 2 years ago

第一个问题，你看你的read1是不是30bp的，C4的结构是20bp cell barcode和10bp umi。如果还有问题，我建议联系你的合作方问问。第二个问题，不需要用。

L1angyan commented 2 years ago

非常感谢您的回答。第二个问题，为啥不需要呢？manual里面提到链特异性才不需要用-ignore-strand啊。我把R2当bulk RNA-seq数据比对了，igv里面发现同一个位置两个方向的reads都有，这说明它并非链特异性文库啊。

shiquan commented 2 years ago

因为对于C4和10X这类建库是链特异性文库，也就是测序出的reads和RNA相同方向的。另外只要有polyA的RNA都会被捕获到，包括了lncRNA（其中一部分是antisense RNA），因此出现antisense RNA很正常，你的文库是不是细胞核？对于whole cell文库，antisense RNA出现的比例较低（general speaking, <10%），而对于细胞核这个比例是偏高的，因为antisense RNA是特异的富集在核[1,2]。除此之外intron reads的比例对于细胞核文库也会偏高，因为未完成剪切的非成熟rna也是富集在细胞核内。

ref:

Derrien, Thomas, et al. "The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression." Genome research 22.9 (2012): 1775-1789.
Halpern, Keren Bahar, et al. "Nuclear retention of mRNA in mammalian tissues." Cell reports 13.12 (2015): 2653-2662.

L1angyan commented 2 years ago

因为对于C4和10X这类建库是链特异性文库，也就是测序出的reads和RNA相同方向的。另外只要有polyA的RNA都会被捕获到，包括了lncRNA（其中一部分是antisense RNA），因此出现antisense RNA很正常，你的文库是不是细胞核？对于whole cell文库，antisense RNA出现的比例较低（general speaking, <10%），而对于细胞核这个比例是偏高的，因为antisense RNA是特异的富集在核[1,2]。除此之外intron reads的比例对于细胞核文库也会偏高，因为未完成剪切的非成熟rna也是富集在细胞核内。

ref:

Derrien, Thomas, et al. "The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression." Genome research 22.9 (2012): 1775-1789.

Halpern, Keren Bahar, et al. "Nuclear retention of mRNA in mammalian tissues." Cell reports 13.12 (2015): 2653-2662.

噢噢噢, 明白了。我是下载的“Rolling back human pluripotent stem cells to an eight-cell embryo-like stage”里面的数据。另外，关于之前提到问题，我这套数据的read1是41bp，应该是用的以前老版本的文库结构。 https://www.biorxiv.org/content/10.1101/818450v3.full For C4 scRNA-seq data, the cell barcodes (base 1 to base 10 and base 17 to base 26) and UMIs (base 32 to 41) are in read 1 and the cDNA reads are in read 2 感谢您细致的解答^_^

shiquan commented 2 years ago

嗯，这个是之前测试的文库版本，可以尝试使用这个命令 PISA parse2 -rule 'CR,R1:1-10,config/DNBelabC4_barcodes.txt,CB,1;CR,R1:17-26,config/DNBelabC4_barcodes.txt,CB,1;UR,R1:27-36;R1,R2' -1 parsed.fq read_1.fq read_2.fq

我没有测试过，你可以尝试在这个基础上修改下。config/DNBelabC4_barcodes.txt这个文件在PISA目录里。

L1angyan commented 2 years ago

嗯，这个是之前测试的文库版本，可以尝试使用这个命令 PISA parse2 -rule 'CR,R1:1-10,config/DNBelabC4_barcodes.txt,CB,1;CR,R1:17-26,config/DNBelabC4_barcodes.txt,CB,1;UR,R1:27-36;R1,R2' -1 parsed.fq read_1.fq read_2.fq

我没有测试过，你可以尝试在这个基础上修改下。config/DNBelabC4_barcodes.txt这个文件在PISA目录里。

好的，非常感谢！！

shiquan / PISA

PISA parse2 #8