marcelTBI / GenomeScreen

Scripts and data needed to run GenomeScreen
Other
4 stars 0 forks source link

Creating the npz files #1

Closed liza-alpinia closed 2 years ago

liza-alpinia commented 2 years ago

Good day!

As far as I understand from the article, the genomescreen algorithm consists of 4 stages, but the sequence mapping stage is not included in the python code. It is not entirely clear after which of the stages it is necessary to form a numpy array?

marcelTBI commented 2 years ago

Hi, yeah, first you have to map the reads, then it is highly recommended to apply loess correction and PCA normalization (all these steps are not included in python codes but are in detail described in the article https://doi.org/10.3390/diagnostics11040708 ). Both of the normalizations are done already on the binned read counts, thus it is very convenient and space-effective to store only those, although it is not entirely usual.

marcelTBI commented 2 years ago

However, if you do not want to do the normalizations, just fill both columns (bins_loess and bins_PCA) with the read count per bin. You will get slightly imperfect results, but it should suffice to test the program.

liza-alpinia commented 2 years ago

Can you suggest if I understand correctly that the 789 samples described in the article are used as a reference and do not have chromosomal aberrations? Or how reference group is made?

marcelTBI commented 2 years ago

Yes, exactly - those 789 samples were used to create the "means_c15_genomic.npy" file, where there are stored means of the read counts per bin of the 789 genetically healthy samples.

marcelTBI commented 2 years ago

To see the training, please see https://github.com/marcelTBI/CNV_data , where there is described a predecessor of GenomeScreen (tool for NIPT CNV detection). Note that the training set was different, due to different laboratory preparation - the tool is sensitive to laboratory processing and thus should be retrained for different laboratory processing.