shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
182 stars 13 forks source link

[Suggestion] Use score calibration when identifying proviruses and plasmids #35

Open apcamargo opened 1 year ago

apcamargo commented 1 year ago

Just a suggestion for the database build step. Since the sample size is pretty big, it's worth using the --enable-score-calibration in geNomad. This parameter substantially improves the classification accuracy and should improve the reliability of the detection (see Supplementary Figure 5D here).

shenwei356 commented 1 year ago

Because this process depends on reliable estimates of the underlying compositions, it works best for samples with sufficient size (e.g. ≥ 1,000 sequences). geNomad will automatically disable calibration if the input file contains less than 1,000 sequences.

I apply geNomad to GTDB assemblies which usually contain < 200 contigs, does the flag --enable-score-calibration work?

I aim to mask both prophages and plasmids, which would affect the bacteria/viruses detection accuracy. So I used the --relaxed flag.


On an assembly (GCA_000018565.1) with plasmids and prophages, the scores of viruses increased, but that of plasmids fell, and one plasmid was missed. It might be because the plasmid_score is smaller than the threshold (0.5?)

$ seqkit seq -n GCA_000018565.1.fna.gz 
CP000875.1 Herpetosiphon aurantiacus DSM 785, complete genome
CP000876.1 Herpetosiphon aurantiacus DSM 785 plasmid pHAU01, complete sequence
CP000877.1 Herpetosiphon aurantiacus DSM 785 plasmid pHAU02, complete sequence

Without score calibration:

$ genomad end-to-end  --relaxed --cleanup --threads 40 GCA_000018565.1.fna.gz genomad1 ~/ws/db/genomad/genomad_db

$ csvtk pretty -t GCA_000018565.1_plasmid_summary.tsv 
seq_name     length   topology              n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes   amr_genes
----------   ------   -------------------   -------   ------------   -------------   ---   -----------   -----------------   -----------------   ---------
CP000877.1   99204    No terminal repeats   90        11             0.7170          NA    0             0.9680              NA                  NA       
CP000876.1   339639   No terminal repeats   283       11             0.5734          NA    0             -23.4690            NA                  NA 

$ csvtk pretty -t GCA_000018565.1_virus_summary.tsv 
seq_name                              length   topology   coordinates       n_genes   genetic_code   virus_score   fdr   n_hallmarks   marker_enrichment   taxonomy                                                       
-----------------------------------   ------   --------   ---------------   -------   ------------   -----------   ---   -----------   -----------------   ---------------------------------------------------------------
CP000875.1|provirus_608567_657053     48487    Provirus   608567-657053     56        11             0.8892        NA    7             21.2187             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_1536575_1622920   86346    Provirus   1536575-1622920   112       11             0.8555        NA    6             23.4685             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_743593_778067     34475    Provirus   743593-778067     37        11             0.8112        NA    5             17.9873             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes

With score calibration:


$ genomad end-to-end --enable-score-calibration --relaxed --cleanup --threads 25 GCA_000018565.1.fna.gz genomad.cali ~/ws/db/genomad/genomad_db

$ csvtk pretty -t GCA_000018565.1_plasmid_summary.tsv
seq_name     length   topology              n_genes   genetic_code   plasmid_score   fdr      n_hallmarks   marker_enrichment   conjugation_genes   amr_genes
----------   ------   -------------------   -------   ------------   -------------   ------   -----------   -----------------   -----------------   ---------
CP000877.1   99204    No terminal repeats   90        11             0.5709          0.4291   0             0.9680              NA                  NA

$ csvtk pretty -t GCA_000018565.1_virus_summary.tsv 
seq_name                              length   topology   coordinates       n_genes   genetic_code   virus_score   fdr      n_hallmarks   marker_enrichment   taxonomy                                                       
-----------------------------------   ------   --------   ---------------   -------   ------------   -----------   ------   -----------   -----------------   ---------------------------------------------------------------
CP000875.1|provirus_608567_657053     48487    Provirus   608567-657053     56        11             0.9726        0.0274   7             21.2187             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_1536575_1622920   86346    Provirus   1536575-1622920   112       11             0.9479        0.0398   6             23.4685             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
CP000875.1|provirus_743593_778067     344
apcamargo commented 1 year ago

I misread your code. I though you were running geNomad on a concatenated FASTA (which wouldn't make any sense). You're right that calibration probably won't work well on single genomes because there are few scaffolds. I recommend using it when you have ~1,000 sequences (although ~500 or less would also work well, as you can see in Suppl. Figure 2B).

As for your example, it is difficult to tell what happened from the summary files. You can see the calibrated scores of all sequences in the *_score_calibration/*_calibrated_aggregated_classification.tsv file. The scores can either increase or decrease, it depends on the sample composition. But since you have a lot of genomes with very few scaffolds, it's probably not a good idea to calibrate.

Also, the cutoff when using the --relaxed flag is not 0.5. geNomad will assign sequences to the highest scoring class and the score can be as low as 0.34.

The only option that I can think of is concatenating all of the genomes, running geNomad on the concatenated FASTA, and then "deconcatenating". But that's a lot of work and might not work with your parallelization mechanism.