ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
101 stars 13 forks source link

[BUG]: Contam file is empty but output shows taxonomy of contaminates #82

Closed brysonab closed 4 months ago

brysonab commented 4 months ago

Hello FCS-GX Team!

The first step of the pipeline appears to be working as intended, but the second cleaning step outputs an empty contamination file. I am running this program on a large 3Gb plant genome.

First I run: python3 ./fcs.py screen genome --fasta genome.fasta --out-dir ./gx_out/ --gx-db /fcs-gx/db --tax-id 53176

I get the output "genome.fasta.53176.taxonomy.rpt" where the first few lines look like this: contig_17436_pilon 557103 381290,4387,540,0,0 68217 | Salvia splendens 180675 plnt:plants 61003 30180 524 | 4155 plnt:plants 61003 18903 473 | 3218 plnt:mosses 2552 2552 96 | 139649 anml:insects 5055 1172 44 | n/a repeat plnt:plants 11 contig_19147_pilon 361216 296806,2089,300,0,0 27583 | Sesamum indicum 4182 plnt:plants 26584 16219 353 | 4155 plnt:plants 26584 13755 334 | 56867 plnt:mosses 307 307 39 | 189291 anml:nematodes 791 748 35 | n/a repeat plnt:plants 7

I also get the command line summary output: Processed 3162 queries, 2932.32Mbp in 24916s. (0.117688Mbp/s); num-jobs:29293 Species : None Asserted div : plnt:plants Inferred primary-divs : ['plnt:plants', 'plnt:mosses'] Corrected primary-divs : ['plnt:plants', 'plnt:mosses'] Putative contaminant divs : [] Aggregate coverage : 11% Minimum contam. coverage : 53%

However, the rest of the command line summary does not report any numbers.

Next I run: zcat genome.fasta | python3 ./fcs.py clean genome --action-report ./gx_out/genome.53176.fcs_gx_report.txt --output genome_nocontam.fasta --contam-fasta-out contam.fasta

This is where my issue arises. The contam.fasta file is empty and the genome_nocontam.fasta has the entire genome. This does not make sense biologically, as I would expect endo and ephiphytic microbes to be contaminates at the very least. Can you please help me determine where I am going wrong?

Thanks!

To Reproduce python3 ./fcs.py screen genome --fasta genome.fasta --out-dir ./gx_out/ --gx-db /fcs-gx/db --tax-id 53176 zcat genome.fasta | python3 ./fcs.py clean genome --action-report ./gx_out/genome.53176.fcs_gx_report.txt --output genome_nocontam.fasta --contam-fasta-out contam.fasta

Software versions:

etvedte commented 4 months ago

Hello,

So my understanding here is that you expect to see some contamination in your assembly and genome.53176.fcs_gx_report.txt is empty aside from the header lines?

The aggregate coverage of this genome is low with respect to the database sequences is low (11%) but this isn't too surprising given the limited representation of large plant genomes in the database. This has the effect of requiring higher stringency for calling contaminants to minimize false positives (i.e. requires higher coverage to call contaminant sequences). So it is certainly possible that if there are contaminating microbes with modest coverage then GX would miss these.

Are there sequences you believe are real contaminants but GX has missed? Without knowing anything about your curation steps post-assembly, I would say you could look at the taxonomy.rpt file columns 32+33, particularly in smaller assembly sequences, to get a better assessment. If there are lots of inconclusive or low-coverage prok sequences, these could represent GX sensitivity issues and you could try some additional validation (e.g. BLAST, a second contamination detection tool) to gather more information.

It is also possible that your assembly is contaminant-free. I see you polished with pilon...did you any curation/filtering of sequences post-assembly?

Eric

etvedte commented 4 months ago

As a side note, please take a look at https://github.com/ncbi/fcs/wiki/FCS-GX-output to see the difference between the taxonomy report *.taxonomy.rpt and the action report *.fcs_gx_report.txt. The former has data for all sequences in the assembly whether assessed to be contaminant or not. The latter has only information for assigned contaminants.

brysonab commented 4 months ago

Hi Eric!

Thank you so much for the quick response.

I did do some things to clean up my genome, including running pilon to polish as well as repeat masker for removing sequences related to TE's. I would guess that there would be contaminate sequences from microbes on and in the leaf tissue I used for sequencing. However perhaps it is the low coverage as you suggested, not allowing GX to pick up on these contaminates.

Thank you for your suggestions, I will look into the smaller contigs and maybe try another software.

Thanks again! Abby

etvedte commented 4 months ago

Hi Abby,

Sometimes what is likely to be a biological contaminant doesn't end up getting sequenced or assembled, but nevertheless it's good that you are erring on the side of precaution.

I'm going to close this for now, but please re-open if you believe FCS-GX is missing things that you believe should definitely be contaminating sequences.

Eric