ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

CAMISIM: Generate the abundance profiles #19

Open ndreey opened 1 year ago

ndreey commented 1 year ago

With the chosen genomes, generate the abundance profiles. Smartest seems to create a bash script that generates it for us. Here is info from my comment on [#13]

HOW I UNDERSTAND THE ABUNDANCE CALCULATION The abundance is calculated based on the total sum of genome sizes.

Good info on these issues

ndreey commented 1 year ago

CREATING DATAFRAME

pseudo code

list_genome_id = []
list_size = []
list_NCBI = []
list_tax_group = []
list_group = []

for *.fasta in source_genomes/:

    # Get genome_id
    match genome_id with *.fasta filename using genome_to_id.txt
    add genome_id to list_genome_id

    # Get size  
    match size with *.fasta filename using report_genome.txt
    add size to list_size

    # Get taxid
    match NCBI_ID with genome_id using metadata.tsv
    add NCBI_ID to list_NCBI

    # Get taxonomy group id
    match tax_group with NCBI_ID using taxonomic_profile.tsv
        in $TAXPATH get first number ^[0-9]|      # 2|1239|186801   --> 2
    add tax_group to list_tax_group

    # Get humanized group name
    match tax_group with group using if
    if tax_group != 2759:
        group = "not_euk"
    else:
        group = "euk"
    add group to list_group

    # Create dataframe
    df <- data.frame(genome_id = list_genome_id, size=, taxid=, tax_group=, group=)
ndreey commented 1 year ago
ndreey commented 1 year ago

26

ndreey commented 1 year ago

Going with 70-80% Fungi, 20-30% Bacteria/Archaea and 0.5-2% Plasmids/Circular DNA/Virus #17

ndreey commented 1 year ago

Example

The relative abundance between the endophytes are: