ndreey commented 1 year ago

With the chosen genomes, generate the abundance profiles. Smartest seems to create a bash script that generates it for us. Here is info from my comment on [#13]

HOW I UNDERSTAND THE ABUNDANCE CALCULATION The abundance is calculated based on the total sum of genome sizes.

Say i have G1, G2, G3, ..., G10, Orchid genomes. And i want Orchid to have an abundance of 50%.

G1:G5 is 1000bp each, G6:G10 is 1500bp each and Orchid is 12000bp

Genome    Size
G1        1000
G2        1000
...   
G6        1500
G7        1500
...    
Orchid    12000

Calculate the total genome size
- tot = 1000 x 5 + 1500 x 5 + 12000 = 19500bp
Calculate the abundance value for each genome.
- abu = 1 / (number of genomes - 1)
- For G1 to G10 there are 10-1 = 9 genomes.
- abu for G1 to G10 are 1/9 = 0.1111
Set the abundance value of Orchid to 0.5
Calculate the total abundance value for all genomes.
- abu_tot = abu_orchid + sum(abu_G1:abu_G10)
- abu_tot= 0.5 + 9 x 0.1111 = 1.5
Normalize the abundance values so they sum up to 1.
- nrm_abu_orchid = abu_ochid / abu_tot= 0.5 / 1.5 = 0.3333
- nrm_abu_Gn = 0.1111 / 1.5 = 0.0741
BOOOM there is your relative abundance. But NOTE there should not be a heading row in the abundance.tsv file
However, it seems that abundance don't have to sum up to 1 as can be seen in the example above. But doing it this way i am able to sum all abundances to 1 "ish". 0.0741 x 10 + 0.3333 ~ 1
```
Genome    Abundance
G1        0.0741
G2        0.0741
... 
G10       0.0741
Orchid    0.3333
```

Good info on these issues

ndreey commented 1 year ago

CREATING DATAFRAME

I want a df with these columns genome_id size taxid tax_group group

pseudo code

list_genome_id = []
list_size = []
list_NCBI = []
list_tax_group = []
list_group = []

for *.fasta in source_genomes/:

    # Get genome_id
    match genome_id with *.fasta filename using genome_to_id.txt
    add genome_id to list_genome_id

    # Get size  
    match size with *.fasta filename using report_genome.txt
    add size to list_size

    # Get taxid
    match NCBI_ID with genome_id using metadata.tsv
    add NCBI_ID to list_NCBI

    # Get taxonomy group id
    match tax_group with NCBI_ID using taxonomic_profile.tsv
        in $TAXPATH get first number ^[0-9]|      # 2|1239|186801   --> 2
    add tax_group to list_tax_group

    # Get humanized group name
    match tax_group with group using if
    if tax_group != 2759:
        group = "not_euk"
    else:
        group = "euk"
    add group to list_group

    # Create dataframe
    df <- data.frame(genome_id = list_genome_id, size=, taxid=, tax_group=, group=)

ndreey commented 1 year ago

Thaliana genome (GCF_000001735.4_TAIR10.1_genomic) has been replaced with Platanthera_zijinensis_chr
Added these to source_genomes/
- Ceratobasidium_sp_CerAGI
- Rhizoctonia_solani_Rhisola1
- Tulasnella_calospora_Tulcal1

ndreey commented 1 year ago

26

ndreey commented 1 year ago

Going with 70-80% Fungi, 20-30% Bacteria/Archaea and 0.5-2% Plasmids/Circular DNA/Virus #17

ndreey commented 1 year ago

Example

Lets generate 10 GB data from the total genome size of all the 100 genomes:
Host abundance: 50% --> 5GB data belongs to host
Endophyte abundance: 1 - Host abundance --> 5GB belong to endophytes.

The relative abundance between the endophytes are:

Fungi: 67,8% | 3.39GB
- OMF: 45,2% | 1.53GB (2 x rfungi)
  - Three OMFs --> 45.2%/3 = 15.06667% | 0.2305GB
- rfungi: 22.6% |
  - Decided randomly but gotta sum up to 22.6%
Bark: 29,7%
- Decided randomly but gotta sum up to 29.7%
Plasm: 2.5%
- Decided randomly but gotta sum up to 2.5%

ndreey / ghost-magnet

CAMISIM: Generate the abundance profiles #19

26