ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Abundance calculation FIX #48

Closed ndreey closed 1 year ago

ndreey commented 1 year ago

After analyzing the mock reads generated, everyone but 0% HC has a host-contamination of ~99% proving that set_abundance() is incorrect.

Based on the issues on CAMISIMs GitHub, there seem to be two common questions/misunderstandings.

  1. The values in the abundance.tsv files used in metagenomic studies. Some users mistakenly believe that these values represent absolute abundance when in fact, they represent relative abundance.
  2. Different genomes sizes can affect the interpretation of abundance data. If two genomes have different sizes and the same abundance value in abundance.tsv, the resulting dataset may have different proportions of reads mapped to each genome. due to differences in genome size.

Point 1 we accounted for i believe but not Point 2. P. zijinensis is ca 4Gb, while the other ranks are between 30Mb and 0.05 Mb which has clearly caused issue. In the current function, we are stating that 50% of P. zijinensis should be covered (2Gb) and in the local run i set the number of basepairs to be 2.5Gb so it is not odd that 99% is host data.

When doing my research i read about it and decided to add a genome size column to mock_df if Point 1 was not enough. Therefore, set_abundance() needs to be updated or totally revamped to account for genome size.

ndreey commented 1 year ago

generate_abundance.R is a script that combines set_abundance() and write_abundance_tsv but with the addition of genome size adjusting.

A 1Gb CAMISIM has started 06:40 to see if the abundances are in agreement

it finished after 1.5h. Abundance was better but did not agree. It was around 39-56%. Small errors where found in the R script. Re running again after fix but only generating 0.5Gb (started 09:30)

ndreey commented 1 year ago

Several attempts were made to generate 4GB of data per sample, but the laptop encountered errors when attempting to disable the pooled_gsa config option. After multiple runs using various sample sizes, a subsample was created, which I will continue to work with until CAMISIM is available on the HPC system.

Although I tried to use generate_abundance.R to generate exact proportions, it failed. Therefore, I generated abundances with increments of 0.1 ranging from 0 to 0.9, resulting in the abundances presented in the table below.

HC Size(Gb) Tot.Reads host.reads HC%
0 0,92 6164850 0 0,0
0,1 0,96 6401630 1625806 25,4
0,2 0,98 6532582 2803088 42,9
0,3 0,98 6522206 3694752 56,6
0,4 0,99 6617132 4393804 66,4
0,5 0,98 6527248 4956636 75,9
0,6 0,99 6625532 5419054 81,8
0,7 0,99 6609304 5805802 87,8
0,8 1,00 6643978 6134536 92,3
0,9 1,00 6655008 6416996 96,4
0,95 1,00 6657668 6543924 98,3