ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

Running FCS-GX on genomes of unidentified prokaryotic organisms #56

Closed d-golzato closed 7 months ago

d-golzato commented 8 months ago

Dear FCS-GX developers, thank you for this invaluable tool!

I need to submit a prokaryotic genome to NCBI for which no Taxonomy ID is available. However, I want to proactively check for potential contaminating sequences in advance of submission. Currently, I am using the generic Taxonomy ID 2725 (unidentified prokaryotic organism) to run FCS-GX, as I expect it to identify and remove any sequences originating from non-prokaryotic organisms.

I have a few questions on this matter:

I would greatly appreciate your assistance with any of these queries. Thank you.

etvedte commented 8 months ago

Hello,

Interesting use case. Is this a metagenome or a genome?

Tax-id 2725 corresponds to "other unclassified bacteria" and is not a superset of all bacteria. So running with that tax-id in this situation is not the preferred option.

I would recommend running the genome with the parameter --species unknown and not specifying the --tax-id parameter to allow GX to make its best guess on the source organism. Next, look at the log for the GX run. There should be one or more prokaryote divisions listed under Inferred primary-divs : . The first division in that list is the prokaryote group that had the largest aggregate coverage to the query genome. You can then re-run the genome with --species or --tax-id corresponding to that prokaryote group. For example, if a-proteobacteria is the first division, you could run --species alphaproteobacteria or --tax-id 28211.

Is FCS-GX with its default parameters equivalent to the contamination screening performed by NCBI after genome submission? Should I change any specific parameter to obtain the same results?

Use the default FCS-GX parameters. There are some small screens done in internal testing that aren't yet pushed out to external...we will be unifying the two in the future. For now, please also run FCS-adaptor to match NCBI results.

d-golzato commented 8 months ago

Hello etvedte,

Thanks for your quick and exhaustive answer! I'm working with single-amplified genomes of unknown species. I've tried to run fcs-gx with the --species unknown parameter but I'm getting an error due to --tax-id being required:

python ./fcs.py --image ./fcs-gx.sif screen genome --fasta fcsgx_test.fa.gz --out-dir gx-out --species unknown --gx-db "$GXDB/gxdb/"

INFO: Converting SIF file to temporary sandbox... usage: screen genome [-h] --fasta FASTA_FILE --tax-id TAX_ID [--species SPECIES] [--split-fasta BOOL] [--div DIV] [--gx-db GX_DB] [--mask-transposons BOOL] [--bin-dir BIN_DIR] [--allow-same-species BOOL] [--out-basename NAME] [--out-dir OUT_DIR] [--action-report BOOL] [--save-hits BOOL] [--generate-logfile BOOL] [--debug] screen genome: error: the following arguments are required: --tax-id INFO: Cleaning up image...

I'm curious whether NCBI's contamination filtering process for unknown species works with the method you suggested to me, i.e identifying the group with the largest aggregate coverage and then re-running the filtering using that specific tax ID. Or, in the case of unknown species, does NCBI provides a general tax ID for the entire bacterial kingdom (Taxonomy ID: 2) to FCS-GX?

One concern I have about choosing a specific --tax-id is that FCS-GX might flag contigs or sequences as contaminants when they are actually a legitimate part of the species genome, just because they exhibit conservation with an already recognized species. Can this be a problem?

etvedte commented 8 months ago

Can you try with --tax-id 32644?

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=32644

Or, in the case of unknown species, does NCBI provides a general tax ID for the entire bacterial kingdom (Taxonomy ID: 2) to FCS-GX? GX uses a discrete set of taxonomic divisions, not hierarchical in nature.

Can this be a problem? We have some safeguards in prokaryotes to prevent false positive contamination calls from other prokaryote groups. I'm interested in the results you get. I can provide feedback when you get a successful run.

etvedte commented 7 months ago

Closing this issue.