Closed ndreey closed 1 year ago
generate_abundance.R
is a script that combines set_abundance()
and write_abundance_tsv
but with the addition of genome size adjusting.
A 1Gb CAMISIM has started 06:40 to see if the abundances are in agreement
it finished after 1.5h. Abundance was better but did not agree. It was around 39-56%. Small errors where found in the R script. Re running again after fix but only generating 0.5Gb (started 09:30)
Several attempts were made to generate 4GB of data per sample, but the laptop encountered errors when attempting to disable the pooled_gsa
config option. After multiple runs using various sample sizes, a subsample was created, which I will continue to work with until CAMISIM
is available on the HPC system.
Although I tried to use generate_abundance.R
to generate exact proportions, it failed. Therefore, I generated abundances with increments of 0.1 ranging from 0 to 0.9, resulting in the abundances presented in the table below.
HC | Size(Gb) | Tot.Reads | host.reads | HC% |
---|---|---|---|---|
0 | 0,92 | 6164850 | 0 | 0,0 |
0,1 | 0,96 | 6401630 | 1625806 | 25,4 |
0,2 | 0,98 | 6532582 | 2803088 | 42,9 |
0,3 | 0,98 | 6522206 | 3694752 | 56,6 |
0,4 | 0,99 | 6617132 | 4393804 | 66,4 |
0,5 | 0,98 | 6527248 | 4956636 | 75,9 |
0,6 | 0,99 | 6625532 | 5419054 | 81,8 |
0,7 | 0,99 | 6609304 | 5805802 | 87,8 |
0,8 | 1,00 | 6643978 | 6134536 | 92,3 |
0,9 | 1,00 | 6655008 | 6416996 | 96,4 |
0,95 | 1,00 | 6657668 | 6543924 | 98,3 |
After analyzing the mock reads generated, everyone but 0% HC has a host-contamination of ~99% proving that
set_abundance()
is incorrect.Based on the issues on CAMISIMs GitHub, there seem to be two common questions/misunderstandings.
abundance.tsv
, the resulting dataset may have different proportions of reads mapped to each genome. due to differences in genome size.Point 1 we accounted for i believe but not Point 2. P. zijinensis is ca 4Gb, while the other ranks are between 30Mb and 0.05 Mb which has clearly caused issue. In the current function, we are stating that 50% of P. zijinensis should be covered (2Gb) and in the local run i set the number of basepairs to be 2.5Gb so it is not odd that 99% is host data.
When doing my research i read about it and decided to add a genome size column to mock_df if Point 1 was not enough. Therefore,
set_abundance()
needs to be updated or totally revamped to account for genome size.