timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
103 stars 21 forks source link

whippet-quant.jl does not output integer gene level read counts #128

Closed MiqG closed 2 years ago

MiqG commented 2 years ago

Hi!

First, thanks for creating such an efficient and useful package!

I have just started trying out Whippet.jl to extract information from RNA seq data. So far, following your documentation, I was able to build the index using GENCODE annotations and to run whippet-quant, which resulted in 5 different files (I used quant as a prefix here):

$ ls
quant.gene.tpm.gz  quant.isoform.tpm.gz  quant.jnc.gz  quant.map.gz  quant.psi.gz

As I explored the file quantifying mRNA levels at the gene level I saw that you report them as TPM and read count. However, the read count does not seem to be an integer in all cases. Is there a reason for that? I am interested in being able to differential gene expression analysis as well with packages like DESeq that only accept integer count data. Should I round it into integers?

$ zcat quant.gene.tpm.gz | head -20
Gene    TpM Read_Counts
ENSG00000225605.3   0.0 0.0
ENSG00000273004.1   0.1 4.0
ENSG00000142945.13  37.7    6997.0456
ENSG00000143341.12  0.0 13.0
ENSG00000174307.7   19.8    1976.9979
ENSG00000116299.17  0.2 8.0
ENSG00000222552.1   0.0 0.0
ENSG00000042781.14  0.0 0.0
ENSG00000241347.3   0.0 0.0
ENSG00000214204.4   0.0 1.0
ENSG00000287525.1   0.0 0.0
ENSG00000225387.1   0.0 0.0
ENSG00000170989.10  0.0 1.0
ENSG00000158793.14  13.1    956.009
ENSG00000180409.3   0.0 0.0
ENSG00000134717.18  37.9    4078.0465000000004
ENSG00000275213.1   0.0 0.0
ENSG00000229989.5   0.0 0.0
ENSG00000201925.1   0.0 0.0

I am running Whippet v1.6.1.

Thank you very much in advance!

Miquel

timbitz commented 2 years ago

Hi @MiqG, floating point read counts for genes and isoforms are a result of multi-mapping or ambiguous reads, which can be partially assigned by the expectation-maximization algorithm. This is expected (details are in the paper, and many other papers/software which do essentially the same thing). How you handle this downstream with other tools is up to you.

MiqG commented 2 years ago

Perfect, thank you!