marcelTBI / GenomeScreen

Scripts and data needed to run GenomeScreen
Other
4 stars 0 forks source link

can you tell me how to normalize from read counts? #6

Open ChenDepp opened 1 month ago

ChenDepp commented 1 month ago

hi @marcelTBI I now only have the raw reads count of each bin,which hg19 genome are you using? can you give me a download url ? I want to use the trained means file(npy) you built, so i should use the same human genome as you. Normalization includes GC bias correction and principal component analysis normalization, can you tell me how to normalize from read counts? I read the Readme (https://github.com/marcelTBI/CNV_data) and if I use the trained mean file you built, then I just need to do the GC bias correction myself and then get the PCA using your trained npy file Normalized bin count, it is right? waiting for your reply! have a good day

marcelTBI commented 1 month ago

Hi, I am not entirely sure, but I think we used this one https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

I think you are right. To be sure, you need to do this:

  1. bin reads (you already have this)
  2. GC correction of binned reads (no script provided in CNV_data/GenomeScreen repositories as it is quite standard procedure)
  3. train pca normalization - either on your samples or on samples in CNV_data (python create_pca.py)
  4. apply pca normalization on your samples - python add_pca.py
  5. train means (optional) - either on your samples or on samples in CNV_data (python train_means.py) or use pretrained provided in CNV_data/GenomeScreen repos
  6. run GenomeScreen

steps 3-5 are better described in https://github.com/marcelTBI/CNV_data repository. Hope it helps. This tool is no longer actively developed, so unfortunately I will not be able to help you more.