Fastq and Bam Preprossing Guidline

XubCherif commented 2 years ago

Hi, Please didn't find any recommendation on how the Bam file should be preprocessed:

Is there any genome reference recommended hg19/hg38.
keep only reads that are uniquely mapped.
mapping with zero or n mismatch
removing duplicate

I'm asking because these steps can influence the result / mandatory with others tools /Package (is is the case). Any link to best NIPTeR best practice will be appreciated

Many Thanks.

ljohansson commented 2 years ago

Hi XubCherif, During our project we worked with the 1000G phase 1 reference genome for build 37. However, in principle since all builds should have the same size all should work. We have noticed that some users experienced errors in the GC correction step because of differences in the number of 50,000 bp bins, I'm not sure which reference they have used. I would use a reference without any ALT sequences. Your other suggestions are good. You will already remove some noise in these preprocessing step. The chi-squared variation reduction algorithm can then correct for any remaining variation. Note that with the expected input ultra-low coverage WGS, you would not expect many duplicate reads.

More information can be found in the NIPTeR papers: especially in Additional file 1 of the application note: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2557-8 Algorithm information: https://www.nature.com/articles/s41598-017-02031-5

XubCherif commented 2 years ago

Hi @ljohansson, Thanks noted. PS: very nice papers and Package

molgenis / NIPTeR

Fastq and Bam Preprossing Guidline #26