nsalomonis / altanalyze

AltAnalyze is a multi-functional and easy-to-use software package for automated single-cell and bulk gene and splicing analyses. Easy-to-use precompiled graphical user-interface versions available from our website.
http://www.altanalyze.org
Apache License 2.0
98 stars 30 forks source link

TPM or RPKM? #44

Closed nsalomonis closed 4 years ago

nsalomonis commented 4 years ago

[Question from user by email]

Did altanalyze change it’s exp. file in the ExpressionInput folder? There is now ENS numbers with the : exons numbers… there is also the -steady state file but if I rerun alt using the steady state I get a warning not to run alt using that file… Is there a way to get back to just having a gene matrix table with FPKM (nonlog) and the Gene Symbol?

nsalomonis commented 4 years ago

Did altanalyze change it’s exp. file in the ExpressionInput folder?

If you analyze FASTQ files directly with the Kallisto option, there will now be expression files created: A) with non-log TPM values by Kallisto directly with Ensembl IDs (file name will have the name Kallisto in it) and B) the original non-log RPKM file without Kallisto and with the name Steady-State. If you re-run AltAnalyze from the Kallisto exp file, results will be kallisto specific (TPM), if from the non-kallisto expression file, the steady-state RPKM file will be used. Why is now more complicated? Answer: When using Kallisto for gene quantification and splicing, we had to make a choice whether to default to the Kallisto TPM values or the original RPKM. We ended up using the Kallisto TPM values for consistency, but wanted to give the choice of using the RPKM as a secondary option (which has a larger gene-definition database by default, but only uses exon-exon junctions for quantification, whereas the Kallisto uses reads to any gene region).

There is now ENS numbers with the : exons numbers… there is also the -steady state file but if I rerun alt using the steady state I get a warning not to run alt using that file…

The program starts the analysis from the non-steady-state file to allow for splicing analyses as well (exon and exon-junction level). This has always been a warning in the software to force the user to select the non-steady-state file for further processing. You can directly analyze the steady-state file by moving to a new folder and just removing the steady-state name (will ignore splicing). There wasn't a primary file before with gene symbols, but all results in the ExpressionOutput folder (DATASET prefix file) have the gene symbols added (Kallisto TPM or RPKM file). Kallisto DATASET files will have the name Kallisto.

Is there a way to get back to just having a gene matrix table with FPKM (nonlog) and the Gene Symbol?

If you want all expression values with gene symbols, there are a few ways:

1) On the command-line while running, add the flag: --inclraw yes

2) In the file: Config/default-exp.txt: change include_raw_data = yes for all rows

3) In the GUI, in the Expression Analysis Parameters, for the option: Include replicate experiment values in export, set to yes.

The above will produce a DATASET file with expression values for all samples, not just summary statistics (was set to "no" mainly for single-cell datasets with thousands of columns).

To just add gene symbols to the exp. files: 1) On the command-line: python AltAnalyze.py --accessoryAnalysis IDTranslation --inputIDType Ensembl --outputIDType Symbol --input "/jsmith/inputs/gene_data.txt" --species Hs 2) Through the GUI: https://altanalyze.readthedocs.io/en/latest/IdentifierTranslation/