ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Dilute mock data with reference. #9

Closed ndreey closed 1 year ago

ndreey commented 1 year ago

Does not have to be perfect, a simple script will suffice. Number of reference reads / total number of reads = host contamination

I believe benchmarking of 0, 50, 80 and 95% host contamination will be statistically significant (maybe skip 80% depending on time, but would be nice for correlation).

ndreey commented 1 year ago

Diluting the already mock data seems more efficient than having NGSNGS generate 10 samples using P_zinji ref + mock data reference genomes.

ndreey commented 1 year ago

Mentioned this in a different issue. "I were to dilute a 4.5GB mock data file to reach 90% P. zijinensis the file would become ~45GB??."

This is not good..

From the paper:

This leaves me with these options..

ndreey commented 1 year ago

The most logical will be to run CAMISIM as it was designed to create benchmark datasets. I will therefore explore option A.