hadasvolk / CompLabNGS

Computational Lab in Next Generation Sequencing and Genomics Data Analysis - TAU 0411358701
MIT License
1 stars 1 forks source link

threshold transcript level #16

Closed tehilayehudai closed 4 months ago

tehilayehudai commented 4 months ago

Hi Hadas, Regarding setting thresholds for TPM-normalized data to speed up the analysis and reduce noise levels. I'm a bit unsure about what to do next. Should we proceed with using a specific threshold (some number) for TPM values, even though they're already normalized (Since TPM values are normalized, applying a regular threshold might not yield the desired results.)? Or would it be better to work with the unnormalized data instead (you wrote to take the TPM column)?

Thanks, Tehila

hadasvolk commented 4 months ago

Hi, you should decide on a TPM threshold value based on the data you produced. You can start with "default" values as commonly used or you can do something more data driven, such as plotting the TPM distribution and finding out what are outliers or z-score thresholds.

tehilayehudai commented 4 months ago

Thanks! But any number I take after normalization can not be suitable for all the examples, but for each example separately

hadasvolk commented 4 months ago

Sorry for the confusion, I failed to understand you We set a threshold on the raw counts (above 10 or so), not on the values after normalization

tehilayehudai commented 4 months ago

I know. But you wrote in your python script to take the 'tpm' column.

hadasvolk commented 4 months ago

the tpm is calculated based on the raw counts after threshold removal. We use the tpm afterwards to compute statistics

Ataliai commented 4 months ago

So in our case the result from kallisto is already after removing transcripts with few appearing (because the tpm score appears there)? At what point were we supposed to remove them? Immediately after running kallisto we need to sum up transcript level counts to gene level counts before proceeding to PyDESeq2.

In the file "all_samples_gene_abundance.csv" there are a lot of lines, some with a really small tpm score or 0. Should genes be removed according to the tpm score at this stage?

hadasvolk commented 4 months ago

for each sample you should get an abundance.tsv file. In this file you have an est_counts column. You can take this column instead of the tpm column, decide on a threshold and filter. This would give you input to pyDESeq2 more similar to the HW we had in class.

Generally speaking, the template.ipynb is a guideline/starting point for the final assignment. You should consider the data and procedure based on your understanding. We should not filter normalized counts but raw counts