How to assess contamination calls and interpret associated xls files

Hi again!

I have ContScount crunching an initial 50 public genomes from different animals and single-cell eukaryotes - using NCBI nr database - and have gotten back results on 7 of them. Here are the contamination counts per phylogenetic level:

Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata superkingdom 2 Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata kingdom 3 Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata phylum 3 Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata class 3 Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata order 3 Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata family 4 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera superkingdom 9 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera kingdom 15 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera phylum 127 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera class 127 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera order 127 Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera family 127 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica superkingdom 11 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica kingdom 12 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica phylum 33 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica class 56 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica order 4457 Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica family 4457 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 superkingdom 0 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 kingdom 1 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 phylum 3199 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 class 3202 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 order 3202 Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 family 3202 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara superkingdom 5 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara kingdom 7 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara phylum 17 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara class 20 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara order 35 Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara family 35 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea superkingdom 1 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea kingdom 1 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea phylum 1 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea class 1 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea order 4 Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea family 4 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum superkingdom 4 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum kingdom 4 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum phylum 45 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum class 65 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum order 65 Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum family 5669

You can see 3 of 7 have really high amounts of contamination - and in all three cases it goes deep to the family level - vs in the ContScout paper, it highlights discovery of contamination in public genomes but at high taxonomic levels from bacteria, plants, and fungi - and not too high of a percentage relative to the genome.

I'm trying to work out how to assess if these are likely correct contamination calls or might be false positives.

I understand contamination detection is going to be dependent on phylogenetic resolution within the database used as reference. I used NCBI nr as I'm guessing it is the most comprehensive across eukaryotes in general - and for species outside model organisms and vertebrates - which is where a lot of the species reside in the full 920+ genomes of interest. And I'm guessing, the issue is that phylogenetic gaps in the database between target species and closest database species will reduce phylogenetic resolution at which contamination can be detected - so if the ContScout phylogenetic resolution is at Family level for a given target species assembly, the database is sufficient in providing taxonmic diversity relative to the target species for high resolution detection - at least in a general sense. Is this a fair assessment? And then whatever the phylogenetic resolution / level of contamination detection - the accuracy is independent - so true positive vs false positive calls by ContScout are not sensitive to database selection and its underlying phylogenetic representation of species diversity.

I was also wondering:

For a given species run through ContScout - what would make it sensitive to false-positive declarations of contamination?

What is a scalable (when possible) method you would recommend for evaluating things as true vs false positive - or is this really possible outside structured / annotated datasets?

I am also wondering how to interpret the filtered Excel files - the structure / information within cells is unclear. Are the Excel files something I might leverage in assessing ContScout declared contamination as true or false positive?

Any guidance here would be greatly appreciated!

Thank you very much :) Eric

Dear Eric,

Long read follows, sorry for that. For each run, you need to check the *.RunDiag.xlsx that holds several run quality metrics helping you decide about the finest taxonomy rank at which the ContScout can be considered meaningful. As demonstrated in the MS, under ideal conditions (i.e. with many closely related genomes available in databases both for contaminants and the genome of interest) tool can achieve perfect separation even at family level. However, for other cases (with less relatives in the DB), even reaching Class rank can prove challenging.

In the manual, I tried to highlight a few things to look for while evaluating / interpreting results but I understand that all the Excel data might look odd for the first glimpse.

Here, I am giving you a few detailed examples what to look for.

You are right, in the 833 genome contamination section we did not go beyond superkingdom resolution. There are historical reasons for this. In the first ContScout implementation, we intentionally aimed to perform the separation at the coarse level. Then, during the MS review, consensus request from the reviewers came in for a finer resolution. This is a very reasonable request so we modified the tool and the whole decision making logic accordingly. The first version, with the coarse tax approach could be fully automated. However, as we go towards finer resolution, there is a big risk of running into false positives simply because the taxon labels for the genuine "host" proteins start to scatter. CS gives output for each taxon level but it is up to the operator to evaluate them and pick the right rank. The good news is that distortion effects (loss of precision) can be clearly seen from the diagnostic data.

Main considerations, column names to look for

medRLE_no_null: When assigning individual taxon tags on proteins, CS keeps log about the usual number of consecutive second, third ... best hits that also support that tag call. For instance, if the fist 100 best hits all generally say that your proteins are from E. coli. That is way better than an other case when there was only 1-2 best consecutive proteins supporting protein level taxon calls. This value tells you the median number of best proteins that support the protein calls. The larger the number, the better. You will often see this number sharply falling. If it goes smaller than say 10, than you need to start becoming cautious. If this number goes down to say 5 or below, that is a good indication for insufficient closely related data being around in the database for your genome at the given taxon rank.

The second most important metrics is the Jaccard index. The larger the number, the better. Remember, there are two rounds as CS marking potentially foreign proteins. First, proteins are tagged based individually based on best hits. Then, all proteins are grouped according to the contigs they belong to and the final decision is being made on a majority vote basis. Ideally, the proteins that are marked as foreign in the first step should be the exact same that are marked for removal after the consensus vote. That would indicate a perfect separation of "contamination" and "genuine" proteins over the contigs leading to a Jaccard value of 1. In real life, things tend to be more complicated. For instance, HGT by definition lowers Jaccard. Anyhow, if you see this value sharply drop as you go towards finer tax resolution, that is a strong signal to stop. Together with this statistics, you have to check the IndivProtDrop (first step marked) and CtgProtDrop (consensus vote marked) counts. As long as they look similar, you are good to go. When the IndivProtDrop starts increasing far above the CtgProtDrop that is a strong signal to stop. In general, you hope to see a signal of contamination that is stable across multiple subsequent taxon ranks. (especially: number of suspected / marked proteins and Jaccard index.)

Also, please have a look at NumTagsKeptByCtg. This value tells you about how many different taxon tags at a taxon rank were observed in the contigs that were finally kept as genuine by CS. Ideally, this value should be 1 or close to it. In real life, this value can be slightly higher but it should be fairly stable across subsequent ranks as you move from coarse to fine ranks. Often, there is a taxon rank where this value starts increasing sharply. When it does, that is a strong indication to stop and fall back to a coarser taxon rank.

Similarly, NumTagsDropByCtg, that tells about the number of tags among contaminant proteins, should be fairly stable among taxon rank and should remain generally low. This value is less of an indicator though.

NumMixedTags is also an important metrics. This gives you feedback about the number of taxon tags that appear both among the kept and dropped proteins in the CS results. Under ideal case, this number should be zero or close to zero. Like NumTagsKeptByCtg, we expect this value to be stable across several ranks while a rapid increase is likely to indicate loss of precision.

Unfortunately, there is no better method for not then to manually judge all these metrics in a joint manner. In case you are in doubt, you might wish to staying at some coarse rank (superkingdom or kingdom) as they are generally more resistent against sparse taxon sampling.

Hope this information helps. Also, if you wish, we could agree on an online meeting where I could have a look on your actual real life outputs and help you gain confidence regarding where to draw the line for ContScout.

Yours

Balazs

On Thu, 24 Oct 2024 at 21:21, Eric Edsinger @.***> wrote:

Hi again!

I have ContScount crunching an initial 50 public genomes from different animals and single-cell eukaryotes - using NCBI nr database - and have gotten back results on 7 of them. Here are the contamination counts per phylogenetic level:

Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata superkingdom 2 Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata kingdom 3 Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata phylum 3 Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata class 3 Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata order 3 Acropora_muricata Metazoa_Cnidaria_Anthozoa_Scleractinia_Acroporidae_Acropora_muricata family 4 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera superkingdom 9 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera kingdom 15 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera phylum 127 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera class 127 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera order 127 Bolinopsis_microptera Metazoa_Ctenophora_Tentaculata_Lobata_Bolinopsidae_Bolinopsis_microptera family 127 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica superkingdom 11 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica kingdom 12 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica phylum 33 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica class 56 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica order 4457 Aplysia_californica Metazoa_Mollusca_Gastropoda_Aplysiida_Aplysiidae_Aplysia_californica family 4457 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 superkingdom 0 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 kingdom 1 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 phylum 3199 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 class 3202 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 order 3202 Gordionus_sp Metazoa_Nematomorpha_Gordioida_Chordodea_Parachordodidae_Gordionus_sp_m_RMFG_2023 family 3202 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara superkingdom 5 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara kingdom 7 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara phylum 17 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara class 20 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara order 35 Dysidea_avara Metazoa_Porifera_Demospongiae_Dictyoceratida_Dysideidae_Dysidea_avara family 35 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea superkingdom 1 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea kingdom 1 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea phylum 1 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea class 1 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea order 4 Halichondria_panicea Metazoa_Porifera_Demospongiae_Suberitida_Halichondriidae_Halichondria_panicea family 4 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum superkingdom 4 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum kingdom 4 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum phylum 45 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum class 65 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum order 65 Corticium_candelabrum Metazoa_Porifera_Homoscleromorpha_Homosclerophorida_Plakinidae_Corticium_candelabrum family 5669 You can see 3 of 7 have really high amounts of contamination - and in all three cases it goes deep to the family level - vs in the ContScout paper, it highlights discovery of contamination in public genomes but at high taxonomic levels from bacteria, plants, and fungi - and not too high of a percentage relative to the genome.

I'm trying to work out how to assess if these are likely correct contamination calls or might be false positives.

I understand contamination detection is going to be dependent on phylogenetic resolution within the database used as reference. I used NCBI nr as I'm guessing it is the most comprehensive across eukaryotes in general - and outside model organisms and vertebrates - which is where a lot of the species reside in the full 920+ genomes of interest.

I was wondering:

For a given species run through ContScout - what would make it sensitive to false-positive declarations of contamination?

What is a scalable (when possible) method you would recommend for evaluating things as true vs false positive - or is this really possible outside structured / annotated datasets?

I am also wondering how to interpret the filtered Excel files - the structure / information within cells is unclear. Are the Excel files something I might leverage in assessing ContScout declared contamination as true or false positive?

In guidance here would be greatly appreciated!

Thank you very much :) Eric

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/12, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTA2JSCFYZCFPCPBIOTZ5FCEHAVCNFSM6AAAAABQRYWZBCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYTEMZYGQ3TCOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

h836472 / ContScout

How to assess contamination calls and interpret associated xls files #12