Abundance calculation FIX

ndreey commented 1 year ago

After analyzing the mock reads generated, everyone but 0% HC has a host-contamination of ~99% proving that set_abundance() is incorrect.

Based on the issues on CAMISIMs GitHub, there seem to be two common questions/misunderstandings.

The values in the abundance.tsv files used in metagenomic studies. Some users mistakenly believe that these values represent absolute abundance when in fact, they represent relative abundance.
Different genomes sizes can affect the interpretation of abundance data. If two genomes have different sizes and the same abundance value in abundance.tsv, the resulting dataset may have different proportions of reads mapped to each genome. due to differences in genome size.

Point 1 we accounted for i believe but not Point 2. P. zijinensis is ca 4Gb, while the other ranks are between 30Mb and 0.05 Mb which has clearly caused issue. In the current function, we are stating that 50% of P. zijinensis should be covered (2Gb) and in the local run i set the number of basepairs to be 2.5Gb so it is not odd that 99% is host data.

When doing my research i read about it and decided to add a genome size column to mock_df if Point 1 was not enough. Therefore, set_abundance() needs to be updated or totally revamped to account for genome size.

ndreey commented 1 year ago

generate_abundance.R is a script that combines set_abundance() and write_abundance_tsv but with the addition of genome size adjusting.

A 1Gb CAMISIM has started 06:40 to see if the abundances are in agreement

it finished after 1.5h. Abundance was better but did not agree. It was around 39-56%. Small errors where found in the R script. Re running again after fix but only generating 0.5Gb (started 09:30)

ndreey commented 1 year ago

Several attempts were made to generate 4GB of data per sample, but the laptop encountered errors when attempting to disable the pooled_gsa config option. After multiple runs using various sample sizes, a subsample was created, which I will continue to work with until CAMISIM is available on the HPC system.

Although I tried to use generate_abundance.R to generate exact proportions, it failed. Therefore, I generated abundances with increments of 0.1 ranging from 0 to 0.9, resulting in the abundances presented in the table below.

HC: Abundance i wanted to generate.
Size(Gb): Amount of bases generated
Tot.Reads: Number of + and - reads generated summed
host.reads: Number of + and - host reads generated summed
HC%: The percentage of host-contamination in generated sample

HC	Size(Gb)	Tot.Reads	host.reads	HC%
0	0,92	6164850	0	0,0
0,1	0,96	6401630	1625806	25,4
0,2	0,98	6532582	2803088	42,9
0,3	0,98	6522206	3694752	56,6
0,4	0,99	6617132	4393804	66,4
0,5	0,98	6527248	4956636	75,9
0,6	0,99	6625532	5419054	81,8
0,7	0,99	6609304	5805802	87,8
0,8	1,00	6643978	6134536	92,3
0,9	1,00	6655008	6416996	96,4
0,95	1,00	6657668	6543924	98,3

ndreey / ghost-magnet

Abundance calculation FIX #48