Applying on new datasets

rashindrie commented 2 years ago

Hi,

Many thanks for your work. I was able to successfully reproduce your original work and now I am interested in applying it on my own dataset. I have extracted 109 count files (XXX.FPKM-UQ.txt) and each file has information in the following format.

ENSG00000000003.13  222222.22222222
ENSG00000000005.5   222222.22222222
ENSG00000000419.11  222222.22222222
ENSG00000000457.12  222222.22222222
ENSG00000000460.15  222222.22222222
ENSG00000000938.11  222222.22222222
ENSG00000000971.14  222222.22222222
ENSG00000001036.12  222222.22222222
ENSG00000001084.9   222222.22222222
ENSG00000001167.13  222222.22222222
ENSG00000001460.16  222222.22222222
ENSG00000001461.15  222222.22222222
ENSG00000001497.15  222222.22222222
ENSG00000001561.6   222222.22222222

Could you please advise on how I can generate the expression matrix from the count files?
Also is the annotation table something that is manually prepared or is it something that's produced while extracting data?

Thanks, Rashindrie

rashindrie commented 2 years ago

I have another question - Does “TCGA.RNA.Rda” contain FPKM or FPKM-UQ values or something else?

Thanks, Rashindrie

mabraao commented 2 years ago

Hi @rashindrie, nice to hear about your progress!

For the input files, you will need a normalized expression matrix, where the columns are sample IDs and the row are gene names, and a meta table where you inform in each row the sample IDs and the corresponding cancer type of it, both files should match sample ID information.

Regarding the normalization, different procedures were applied to TCGA data but it was similar to Combat adjustments for batch effect. I would recommend applying similar procedures to your data and then doing some sanity check to see if the expression profile that you are getting from your genes also follows the same behavior on TCGA, for example, the gene expression for the cancer type that you are adding correlates with the TCGA samples for the same cancer type?

lawrenson-lab / CaCTS

Applying on new datasets #6