Open apcamargo opened 1 year ago
Because this process depends on reliable estimates of the underlying compositions, it works best for samples with sufficient size (e.g. ≥ 1,000 sequences). geNomad will automatically disable calibration if the input file contains less than 1,000 sequences.
I apply geNomad to GTDB assemblies which usually contain < 200 contigs, does the flag --enable-score-calibration
work?
I aim to mask both prophages and plasmids, which would affect the bacteria/viruses detection accuracy. So I used the --relaxed
flag.
On an assembly (GCA_000018565.1) with plasmids and prophages, the scores of viruses increased, but that of plasmids fell, and one plasmid was missed. It might be because the plasmid_score is smaller than the threshold (0.5?)
$ seqkit seq -n GCA_000018565.1.fna.gz
CP000875.1 Herpetosiphon aurantiacus DSM 785, complete genome
CP000876.1 Herpetosiphon aurantiacus DSM 785 plasmid pHAU01, complete sequence
CP000877.1 Herpetosiphon aurantiacus DSM 785 plasmid pHAU02, complete sequence
Without score calibration:
$ genomad end-to-end --relaxed --cleanup --threads 40 GCA_000018565.1.fna.gz genomad1 ~/ws/db/genomad/genomad_db
$ csvtk pretty -t GCA_000018565.1_plasmid_summary.tsv
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes amr_genes
---------- ------ ------------------- ------- ------------ ------------- --- ----------- ----------------- ----------------- ---------
CP000877.1 99204 No terminal repeats 90 11 0.7170 NA 0 0.9680 NA NA
CP000876.1 339639 No terminal repeats 283 11 0.5734 NA 0 -23.4690 NA NA
$ csvtk pretty -t GCA_000018565.1_virus_summary.tsv
seq_name length topology coordinates n_genes genetic_code virus_score fdr n_hallmarks marker_enrichment taxonomy
----------------------------------- ------ -------- --------------- ------- ------------ ----------- --- ----------- ----------------- ---------------------------------------------------------------
CP000875.1|provirus_608567_657053 48487 Provirus 608567-657053 56 11 0.8892 NA 7 21.2187 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_1536575_1622920 86346 Provirus 1536575-1622920 112 11 0.8555 NA 6 23.4685 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_743593_778067 34475 Provirus 743593-778067 37 11 0.8112 NA 5 17.9873 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
With score calibration:
$ genomad end-to-end --enable-score-calibration --relaxed --cleanup --threads 25 GCA_000018565.1.fna.gz genomad.cali ~/ws/db/genomad/genomad_db
$ csvtk pretty -t GCA_000018565.1_plasmid_summary.tsv
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes amr_genes
---------- ------ ------------------- ------- ------------ ------------- ------ ----------- ----------------- ----------------- ---------
CP000877.1 99204 No terminal repeats 90 11 0.5709 0.4291 0 0.9680 NA NA
$ csvtk pretty -t GCA_000018565.1_virus_summary.tsv
seq_name length topology coordinates n_genes genetic_code virus_score fdr n_hallmarks marker_enrichment taxonomy
----------------------------------- ------ -------- --------------- ------- ------------ ----------- ------ ----------- ----------------- ---------------------------------------------------------------
CP000875.1|provirus_608567_657053 48487 Provirus 608567-657053 56 11 0.9726 0.0274 7 21.2187 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_1536575_1622920 86346 Provirus 1536575-1622920 112 11 0.9479 0.0398 6 23.4685 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_743593_778067 344
I misread your code. I though you were running geNomad on a concatenated FASTA (which wouldn't make any sense). You're right that calibration probably won't work well on single genomes because there are few scaffolds. I recommend using it when you have ~1,000 sequences (although ~500 or less would also work well, as you can see in Suppl. Figure 2B).
As for your example, it is difficult to tell what happened from the summary files. You can see the calibrated scores of all sequences in the *_score_calibration/*_calibrated_aggregated_classification.tsv
file. The scores can either increase or decrease, it depends on the sample composition. But since you have a lot of genomes with very few scaffolds, it's probably not a good idea to calibrate.
Also, the cutoff when using the --relaxed
flag is not 0.5. geNomad will assign sequences to the highest scoring class and the score can be as low as 0.34.
The only option that I can think of is concatenating all of the genomes, running geNomad on the concatenated FASTA, and then "deconcatenating". But that's a lot of work and might not work with your parallelization mechanism.
Just a suggestion for the database build step. Since the sample size is pretty big, it's worth using the
--enable-score-calibration
in geNomad. This parameter substantially improves the classification accuracy and should improve the reliability of the detection (see Supplementary Figure 5D here).