ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

CAMISIM: Generated GSA.fasta containing only "N" for host genome. #64

Closed ndreey closed 1 year ago

ndreey commented 1 year ago

In the last step of AMBER, I noticed that the bin with the majority of host reads was compromised only of N's.

>Chr01_288559838_from_4490413_to_4491469_total_1057
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

I have backtracked the error back to the initial step of CAMISIM.

Which is also confirmed here:

zcat CAMISIM/platanthera_mock/09_output/contigs/gsa.fasta.gz | awk '/^>/{name=$0; getline seq; if (seq !~ /^N+$/) print name;}' | grep "Chr" | wc -l
0

Not a single contig from the host (only one with "Chr") contains a normal sequence.

ndreey commented 1 year ago

Re-ran CAMISIM and... all P.zijinensis reads are "NNNNNNNNNNN"

~ andbo   master ﲵ bash                                                        7.499s  Tuesday at 7:55 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | awk '/^>/{name=$0; getline seq; if (seq !~ /^N+$/) print name;}' | wc -l
61460

~ andbo   master ﲵ bash                                                        4.777s  Tuesday at 7:55 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep ">" | wc -l
3046581

~ andbo   master ﲵ bash                                                         3.41s  Tuesday at 7:56 PM
 zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep "Chr" | wc -l
2985121

~ andbo   master ﲵ bash                                                        3.443s  Tuesday at 7:56 PM
~  zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep -A 4 "Chr" | head
>Chr01_288559838_from_188_to_447_total_260
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_1586_to_1805_total_220
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_2446_to_2725_total_280
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_3951_to_4334_total_384
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_7122_to_7384_total_263
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

3046581-61460 = 2985121

ndreey commented 1 year ago

It seems to be that it is only the P. zijinensis contigs that are affected. The other two genomes that was added to the reference list have a correct sequence.

~ andbo   master ﲵ bash
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa_mapping.tsv | grep "scaffold_1" | head
scaffold_1_from_7870_to_8166_total_297  Ceratobasidium_sp_CerAGI        305860  297
scaffold_1_from_8545_to_8819_total_275  Ceratobasidium_sp_CerAGI        305860  275
...

~ andbo   master ﲵ bash                                                         0.02s  Tuesday at 8:12 PM
~  zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep -A 4 "scaffold_1" | head
>scaffold_1_from_7870_to_8166_total_297
AAATTACCTACCCAAAGCACAATGCATGAGTAGAGGTAATTGGGTGAGAACACGCTTCATTTTCCAAGTTTCCAGACTATCCCATCTCTGCTGTCTCTGACGGGCTTGTTCGTATCTCGTGCCAGAACGACCAGACATGAACGTGTGACACCAGCCTTTCGAGAGCATGGCGATAAGTGACACAGCGATGCTAAGACTCAATGACGTGTACCAAAGCGCGTTTATTACCACACTTAGACGCGAGGCTGAGAAATTAGATTCAAGAGTGGGCAAATTCGGAGTGAGGGTGGGCTCGTT
>scaffold_1_from_8545_to_8819_total_275
TTACGACTAACCTATGCCACCCTTCGATCATCTCCCTGTCTGCACGGTCAGTCTGCTGGACGTAACTTCCCCAAATCTTAGCATCTTGTGCTAGTTCAGCGCCAATTTCTTCAAAGCTATCCTAAGACCAAAGGTGTTGGGTCAGCCTGTGGAGCGCCAGTGATTCTGAATGTAGTTTACTTACTGCATCCTTAGTTGCCTCAGGTGTTGAGCCAGGGTCCACTGTAGTCCCACTCTTCGGAAGGAACATCTTGGTGTCGCAGGGCTTTCGCAAA
...
ndreey commented 1 year ago

Besides the host genome being 4GB while the others barely reach 1GB, the fasta.fai files differ. All the genomes from CAMI have set the number of bases per line to 60. Tulasnella and Ceratobasidium have 70 and P. zijinensis have 100.

I will, therefore, re-standardize the P. zijinensis sequence length to 60 and re-reun the metagenome simulation and see if the error persist. If the error persists, I will half the P. zijinensis genome, and if an error still persists i will change out the genome to A. thaliana to act as a host genome.

ndreey commented 1 year ago

EUREKA!!!

The issue was the length of the sequence per line.

This command was used to generate a sequence length fo 60 per line. awk 'BEGIN{first=1} /^>/ {if(first){printf("%s",$0);first=0}else{printf("\n%s\n",$0);}next; } { printf("%s",$0);}END{printf("\n");}' test2.fasta | fold -w 60 > test2_reformatted.fasta

After running a quick test of CAMISIM i now get this.

~ zcat platanthera_mock/095_output/2023.04.25_21.55.02_sample_0/contigs/gsa.fasta.gz | grep -A 4 "Chr" | head
>Chr01_288559838_from_1586_to_1805_total_220
TCATTATAGAACTCTAAGACATCTAATCACCCCAGGTATAGTCTTCATCAGTGGAAAGCCTGCCCACACGTTGTTCGATACTAGAGCTACCCACTCCTTTGTGTCGGAGATGTTCGTAGAAGGACTGGAAGGCGTGGAGCAATTCACGGGCTCAAAGATGAGGGTGCGACTACCAGATGGATCTGAGCTCCTTACCACGGACCACTACAATATGGAGGCA
>Chr01_288559838_from_2446_to_2725_total_280
CACACTTCTTTTCAAAATTGGATTTGAGGTCGGGCTACTACCAGCTTCATATTCGAGAGGAGGATATTAAGAAGACTGCTTTTAGTAGCAGATATGGATTGTTTGAGTTTACGGTGATGCCCTTTGGGCTCACGAATGCACCTTCGGCGTTCATGACCCTCATGAACCAGGTTTTCAAGGAGTATCTAGATATGTTTGTCATTGTGTTTATCGATGACATCCTCATCTACTCGAGAATGGAAGAGGAGCACGCTCGCCATCTAGCACTGACCCTACAGAG

~ zcat platanthera_mock/095_output/2023.04.25_21.55.02_sample_0/contigs/gsa.fasta.gz | awk '/^>/{name=$0; getline seq; if (seq !~ /^N+$/) print name;}' | grep "Chr" | wc -l
1657440

Now, i just need to re-run every step...