ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

[FEATURE REQUEST]: Using FCS-GX to filter long-read contaminations? #58

Closed BitaoQiu closed 7 months ago

BitaoQiu commented 8 months ago

Is this a feature request for FCS-adaptor or FCS-GX?

For both

Describe the problem you'd like to be solved I wonder if it's possible to use FCS-GX to filter potential contamination right from the raw reads? Also FCS-adaptor could be used to trim adaptor for the raw reads

Describe the solution you'd like As HiFi and nanopore are being common these days, it will be a good feature to filter potential contamination and trim adapters before genome assembling.

Describe alternatives you've considered I could also transform the fq files to Fasta files, but I will loose the quality information ...

etvedte commented 7 months ago

Hello,

While it is possible to run FCS tools on reads in FASTA format, this would be considered non-standard/off-label use. We do not currently have plans to support FASTQ input, and we haven't tested FCS on reads.

For adaptor trimming, there are available tools for working with raw reads. The best ones will use known error profiles of the sequencing technologies to make trim calls. FCS-adaptor is intended to screen assembled genomes for missed adaptors.

GX is sensitive to indels, so it may not work so well with raw nanopore reads. HiFi are probably good enough. By default, GX runs on eukaryote genomes include a masking step based on repeat content in the underlying sequence, so if contaminants were at a high enough frequency they would be missed. So we would recommend turning the masking OFF, but then this would also be more likely to introduce false positive contamination calls. We currently are working on a new GX release to turn masking OFF, so stay tuned.

We recommend to screen the final assembly with GX, so contaminants identified at the assembly stage that were missed during the read stage would not be a big issue. Conversely, imagine a scenario where reads representing a lateral gene transfer (LGT) are labeled as "contaminant" and removed erroneously because there isn't enough context in the surrounding sequence to indicate an integration event.

For file format conversion, you could use seqkit or something similar to subset a FASTQ file for the sequences you want to keep based on the contamination results.

etvedte commented 7 months ago

Closing this issue.