ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

translate score to bit-score for FCS-gx output tax-id matches? #83

Closed blemmond closed 1 week ago

blemmond commented 1 month ago

Hi,

I have been using FCS-gx to screen a large number of fungal genomic assemblies for contaminant organisms. My target organisms are all Pezizomycetes (class of ascomycete fungi), but in addition to bacteria and viruses that are readily removed by FCS-gx, many of the remaining contaminant organisms are other ascomycete fungi (Eurotiomycetes, Dothidiomycetes, etc.) that are still within the asserted "primary-div" of my target organism. I would therefore like to use the FCS-gx taxonomy report file to further screen my genomes, integrating FCS-gx with some other tool such as blobtools, which uses several pieces of evidence (taxonomic affinity, coverage, GC content) to bin contigs into organismal groups. It would save me a lot of time to be able to use the FCS-gx output taxonomy report file as input into blobtools, so that I do not have to BLAST these genomes (most are >70MB) against the ncbi nt database all over again, which is very time consuming (for ca. 300 genomes).

However, blobtools requires an output from a taxonomic search to have query seq id, subject seq id, and bit score in order to apply taxonomic affinity to contigs (https://blobtools.readme.io/docs/taxify). The FCS-gx taxonomy report output has almost all of that, except the 'score' used in FCSgx is not a bit score, but a positive integer. Is there any way to transform this score into a bit-score, or some other way to integrate the FCS-gx output into a tool such as blobtools? I couldn't find any information on this subject online.

Any help is very much appreciated!

etvedte commented 1 month ago

Hello,

First I have some comments:

There is no way to calculate a BLAST bitscore from a FCS-GX alignment score. The FCS-GX score value is the square root of the sum of squares of alignment lengths with 100% nucleotide identity. BLAST bitscores are calculated as a weighted sum of matches/substitutions minus gap penalties and doesn’t account for the spatial distribution of mismatches.

Additionally, the GX taxonomy report provides organism/taxid info but not sseqids. See https://github.com/ncbi/fcs/wiki/FCS-GX-output#fcs-gx-taxonomy-report.

We acknowledge the challenge of re-screening using multiple methods, and a major reason we developed FCS-GX was to improve the processing speed of contamination detection relative to BLAST-based methods. But I also want to point out that the BLAST ntdatabase is oftentimes assumed to be a comprehensive database, which it is not. In particular, it does not include WGS sequences, meaning most genomes are excluded. Depending on the taxonomic group of interest, this can greatly affect sensitivity. The FCS-GX database uses a custom database with diverse set of genomes and the Fungi are well-represented.

Now an idea:

Can you run blobtools taxify on the GX taxonomy report with the parameters -a 1 -c 11 -t 7, i.e. is sseqid needed to run this? If it is needed, does it need to be unique in the input file? You could set it to column 6 (top hit organism) or column 7 again. This might achieve the ability to do taxonomic mapping at multiple ranks for sequences. The caveat here is that blobtools will parse the assignments with a specified --taxrule which is acting on GX scores not bitscores and may or may not play well with sequences with multiple rows in the GX taxonomy report. There are also some nuances where the top organism score in column 11 is not what the sequence is ultimately assigned as, but I don't think that should matter much for this specific application.

So I would definitely take care as direct integration into blobtools is not officially supported but I am interested in seeing what you get out of this.

Eric

blemmond commented 1 month ago

Eric,

Thank you so much for your comments and suggestions. I will see if I can get blobtools to interpret the FCS-gx taxonomy report, despite the differences in the FCS-gx score and bitscore and lack of explicit sseqid... if I get it to work, I will leave a comment with any notes I have from the process. I really appreciate your reply!

Ben

Ben Lemmond PhD Candidate, University of Florida 2523 Fifield Hall 2550 Hull Rd. Gainesville, FL 32611

[cid:944b190b-84f8-44c8-ae0e-5b44cf8e61bc]


From: Eric Tvedte @.> Sent: Monday, May 20, 2024 10:53 AM To: ncbi/fcs @.> Cc: Lemmond,Benjamin Reed @.>; Author @.> Subject: Re: [ncbi/fcs] translate score to bit-score for FCS-gx output tax-id matches? (Issue #83)

[External Email]

Hello,

First I have some comments:

There is no way to calculate a BLAST bitscore from a FCS-GX alignment score. The FCS-GX score value is the square root of the sum of squares of alignment lengths with 100% nucleotide identity. BLAST bitscores are calculated as a weighted sum of matches/substitutions minus gap penalties and doesn’t account for the spatial distribution of mismatches.

Additionally, the GX taxonomy report provides organism/taxid info but not sseqids. See https://github.com/ncbi/fcs/wiki/FCS-GX-output#fcs-gx-taxonomy-report.

We acknowledge the challenge of re-screening using multiple methods, and a major reason we developed FCS-GX was to improve the processing speed of contamination detection relative to BLAST-based methods. But I also want to point out that the BLAST nt database is oftentimes assumed to be a comprehensive database, which it is not. In particular, it does not include WGS sequences, meaning most genomes are excluded. Depending on the taxonomic group of interest, this can greatly affect sensitivity. The FCS-GX database uses a custom database with diverse set of genomes and the Fungi are well-represented.

Now an idea:

Can you run blobtools taxify on the GX taxonomy report with the parameters -a 1 -c 11 -t 7, i.e. is sseqid needed to run this? If it is needed, does it need to be unique in the input file? You could set it to column 6 (top hit organism) or column 7 again. This might achieve the ability to do taxonomic mapping at multiple ranks for sequences. The caveat here is that blobtools will parse the assignments with a specified --taxrule which is acting on GX scores not bitscores and may or may not play well with sequences with multiple rows in the GX taxonomy report. There are also some nuances where the top organism score in column 11 is not what the sequence is ultimately assigned as, but I don't think that should matter much for this specific application.

So I would definitely take care as direct integration into blobtools is not officially supported but I am interested in seeing what you get out of this.

Eric

— Reply to this email directly, view it on GitHubhttps://github.com/ncbi/fcs/issues/83#issuecomment-2120624983, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APTASQLBLSHUTESNGW6WTMLZDIE7VAVCNFSM6AAAAABH5SI5NKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQGYZDIOJYGM. You are receiving this because you authored the thread.Message ID: @.***>

etvedte commented 1 week ago

Please re-open this issue if you have any additional feedback to report.