Closed ndreey closed 1 year ago
Re-ran CAMISIM and... all P.zijinensis reads are "NNNNNNNNNNN"
~ andbo master ﲵ bash 7.499s Tuesday at 7:55 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | awk '/^>/{name=$0; getline seq; if (seq !~ /^N+$/) print name;}' | wc -l
61460
~ andbo master ﲵ bash 4.777s Tuesday at 7:55 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep ">" | wc -l
3046581
~ andbo master ﲵ bash 3.41s Tuesday at 7:56 PM
zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep "Chr" | wc -l
2985121
~ andbo master ﲵ bash 3.443s Tuesday at 7:56 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep -A 4 "Chr" | head
>Chr01_288559838_from_188_to_447_total_260
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_1586_to_1805_total_220
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_2446_to_2725_total_280
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_3951_to_4334_total_384
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Chr01_288559838_from_7122_to_7384_total_263
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
It seems to be that it is only the P. zijinensis contigs that are affected. The other two genomes that was added to the reference list have a correct sequence.
~ andbo master ﲵ bash
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa_mapping.tsv | grep "scaffold_1" | head
scaffold_1_from_7870_to_8166_total_297 Ceratobasidium_sp_CerAGI 305860 297
scaffold_1_from_8545_to_8819_total_275 Ceratobasidium_sp_CerAGI 305860 275
...
~ andbo master ﲵ bash 0.02s Tuesday at 8:12 PM
~ zcat platanthera_mock/095_output/2023.04.25_18.48.59_sample_0/contigs/gsa.fasta.gz | grep -A 4 "scaffold_1" | head
>scaffold_1_from_7870_to_8166_total_297
AAATTACCTACCCAAAGCACAATGCATGAGTAGAGGTAATTGGGTGAGAACACGCTTCATTTTCCAAGTTTCCAGACTATCCCATCTCTGCTGTCTCTGACGGGCTTGTTCGTATCTCGTGCCAGAACGACCAGACATGAACGTGTGACACCAGCCTTTCGAGAGCATGGCGATAAGTGACACAGCGATGCTAAGACTCAATGACGTGTACCAAAGCGCGTTTATTACCACACTTAGACGCGAGGCTGAGAAATTAGATTCAAGAGTGGGCAAATTCGGAGTGAGGGTGGGCTCGTT
>scaffold_1_from_8545_to_8819_total_275
TTACGACTAACCTATGCCACCCTTCGATCATCTCCCTGTCTGCACGGTCAGTCTGCTGGACGTAACTTCCCCAAATCTTAGCATCTTGTGCTAGTTCAGCGCCAATTTCTTCAAAGCTATCCTAAGACCAAAGGTGTTGGGTCAGCCTGTGGAGCGCCAGTGATTCTGAATGTAGTTTACTTACTGCATCCTTAGTTGCCTCAGGTGTTGAGCCAGGGTCCACTGTAGTCCCACTCTTCGGAAGGAACATCTTGGTGTCGCAGGGCTTTCGCAAA
...
Besides the host genome being 4GB while the others barely reach 1GB, the fasta.fai
files differ.
All the genomes from CAMI have set the number of bases per line to 60.
Tulasnella and Ceratobasidium have 70 and P. zijinensis have 100.
I will, therefore, re-standardize the P. zijinensis sequence length to 60 and re-reun the metagenome simulation and see if the error persist. If the error persists, I will half the P. zijinensis genome, and if an error still persists i will change out the genome to A. thaliana to act as a host genome.
The issue was the length of the sequence per line.
This command was used to generate a sequence length fo 60 per line.
awk 'BEGIN{first=1} /^>/ {if(first){printf("%s",$0);first=0}else{printf("\n%s\n",$0);}next; } { printf("%s",$0);}END{printf("\n");}' test2.fasta | fold -w 60 > test2_reformatted.fasta
After running a quick test of CAMISIM
i now get this.
~ zcat platanthera_mock/095_output/2023.04.25_21.55.02_sample_0/contigs/gsa.fasta.gz | grep -A 4 "Chr" | head
>Chr01_288559838_from_1586_to_1805_total_220
TCATTATAGAACTCTAAGACATCTAATCACCCCAGGTATAGTCTTCATCAGTGGAAAGCCTGCCCACACGTTGTTCGATACTAGAGCTACCCACTCCTTTGTGTCGGAGATGTTCGTAGAAGGACTGGAAGGCGTGGAGCAATTCACGGGCTCAAAGATGAGGGTGCGACTACCAGATGGATCTGAGCTCCTTACCACGGACCACTACAATATGGAGGCA
>Chr01_288559838_from_2446_to_2725_total_280
CACACTTCTTTTCAAAATTGGATTTGAGGTCGGGCTACTACCAGCTTCATATTCGAGAGGAGGATATTAAGAAGACTGCTTTTAGTAGCAGATATGGATTGTTTGAGTTTACGGTGATGCCCTTTGGGCTCACGAATGCACCTTCGGCGTTCATGACCCTCATGAACCAGGTTTTCAAGGAGTATCTAGATATGTTTGTCATTGTGTTTATCGATGACATCCTCATCTACTCGAGAATGGAAGAGGAGCACGCTCGCCATCTAGCACTGACCCTACAGAG
~ zcat platanthera_mock/095_output/2023.04.25_21.55.02_sample_0/contigs/gsa.fasta.gz | awk '/^>/{name=$0; getline seq; if (seq !~ /^N+$/) print name;}' | grep "Chr" | wc -l
1657440
Now, i just need to re-run every step...
In the last step of AMBER, I noticed that the bin with the majority of host reads was compromised only of N's.
I have backtracked the error back to the initial step of CAMISIM.
CAMISIM generated, however a gold standard assembly,
gsa.fasta.gz
where the contigs belonging to host only contained N's.Which is also confirmed here:
Not a single contig from the host (only one with "Chr") contains a normal sequence.