ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

Is there a way to use FCS-GX before the assembly step, especially if the input data contains a high proportion of contaminant sequences? #39

Closed ChathumadaviE closed 1 year ago

ChathumadaviE commented 1 year ago

Hi

I'm using a diploid organism for genome assembly and when used the Genomescope tool on Galaxy to evaluate the hifi reads, it created an odd plot than the expected two peak plot. I used FCS-GX and FCS adaptor to clean the sequences after the initial round of assembly using hifiasm assembler. It helped removing many non-target sequences, although I'm wondering if I can use this tool suite to remove contaminant sequences from the reads before assembling in to contigs.

Thanks!

murphyte commented 1 year ago

although I'm wondering if I can use this tool suite to remove contaminant sequences from the reads before assembling in to contigs.

For HiFi, I expect so but we haven't tried it.

GX's ability to align a query sequence to its database is dependent on length, sequence quality, and how closely related it is to what's in the database (which correlates with sequence identity). A high indel frequency combined with shorter reads would be particularly problematic. GX's performance is dependent on bases to screen + also a cost per sequence.

What that means is:

  1. screening Illumina reads would likely only detect high identity contaminants, and also be slow (lots of bases all in very short sequences). It could still be effective
  2. screening 454 reads would be worse because of the short length + indels
  3. older PacBio would also have reduced sensitivity because of the indels
  4. HiFi isn't much different than finished contigs from a lot of Illumina assemblies, except for higher numbers. It should work.

One thing that I'm not sure how to predict is the behavior of the initial repeat identification step. That identifies high-frequency k-mers found within the submission looking at longer sequences (>100 kb) or those higher than the N80 length. It may wind up considering the entire batch of reads in scope for repeat identification. So if you have say a lot of virus contamination, it may all get masked and not identified as contamination. You can control that with the fcs.py screen genome --mask-transposons=F option. That said, I would probably try with the default to start.

One issue is FCS-GX will need FASTA as input, so you either need an assembler that doesn't require FASTQ, or you'd need to filter out the contaminant reads through some other means.

And note filtering at the read level could cause problems for rare organisms with real lateral gene transfer (LGT). i.e. if your organism has a real insertion of say 50 kb of Wolbachia in a chromosome, and you filter out contaminants first, then the Wolbachia-only reads will get dropped and you'll get a gap in your chromosome. Screening after assembly you'd see that as INFO (if it's Wolbachia or Rickettsia, which are someone more common LGT cases) or as FIX or TRIM indicating a chimeric sequence, where you could then evaluate if it's an assembly error in more detail. Real LGT is pretty rare, so it's unlikely to affect you, but it's something to keep in mind.

Do let us know if it works out for you. I've wondered if it would work for a while, but just haven't gotten around to trying it.

-Terence

ChathumadaviE commented 1 year ago

Hi Terence,

Thanks for your response.

HiFi isn't much different than finished contigs from a lot of Illumina assemblies, except for higher numbers. It should work. Do you mean Illumina assemblies are similar to hifi read level? Only thing is that the input has to be in FASTA format, not FASTQ.

One issue is FCS-GX will need FASTA as input, so you either need an assembler that doesn't require FASTQ, or you'd need to filter out the contaminant reads through some other means.

I already followed the above method, where I used initial assembly in FASTA format as the input, and it identified a lot of contaminant sequences present in the initial assembly.

so you either need an assembler that doesn't require FASTQ

I didn't get this part. What do you mean by an assembler that doesn't require FASTQ?

murphyte commented 1 year ago

I didn't get this part. What do you mean by an assembler that doesn't require FASTQ?

fcs.py screen genome will give you the report of which reads are contaminant. You could then use fcs.py clean genome to remove them from your input, but that output would be FASTA. So you'll likely need to use some other means to filter the bad reads out of the data you're feeding to the assembler, if it needs FASTQ input. That's all I meant.

etvedte commented 1 year ago

Hello,

Another option you could consider (either standalone or to support the assignments from a GX run directly on reads) is mapping reads to the contaminated genome, identifying reads that map to contaminant spans, filter those out, then re-assemble with "clean" reads. You would then want to run GX on the newly assembled contigs. This process would be more straightforward if most or all of the GX contamination calls are non-chimeric (e.g. EXCLUDE).

As mentioned above, we would be interested to hear about performance on HiFi reads.

ChathumadaviE commented 1 year ago

That is a good idea, Thanks! Would it be easier to use the clean genome in place of the contaminated genome to map the reads, since the mapping reads will be "clean" reads? And then extract only the mapped reads. Is there going to be a significant difference between the two approaches?

murphyte commented 1 year ago

Would it be easier to use the clean genome in place of the contaminated genome to map the reads, since the mapping reads will be "clean" reads? And then extract only the mapped reads.

With that approach, if any correct sequence was missing from your initial assembly but still found in the reads, it would get discarded. I would keep that in mind for any approach, and favor only discarding reads that have been positively identified as contaminant.

For the contaminants identified by FCS-GX in your initial assembly, what were they (prok, fungi?) and what were the typical coverage values (column 7 in the report)? Some contaminants can be novel organisms which are harder to detect, which will show up as lower coverage values for those sequences that are reported, and that may mean some contamination is still being missed. Also, what was the aggregate alignment coverage reported (see run-info agg-cvg in the report header)? That's mostly a reflection of the organism you're sequencing and how similar it is to what's in the GX database, and has an effect on sensitivity.

ChathumadaviE commented 1 year ago

Thank you for your feedback! After mapping the reads to the contaminated genome, how can I use the taxonomy report information to identify the reads mapped to contaminant spans? Is there a tool like samtools I can use to identify those reads and extract or do I have to manually remove reads?

Most of the contaminants were prok with few being fungi. Typical coverage values ranged from 80 - 100 for "Exclude" action and for "Review" it was between 13-30. "agg-cvg" value was 0.360642. Asserted-div was a crustacean and inferred-primary-divs were prok:b-proteobacteria, prok:g-proteobacteria, prok:a-proteobacteria, prok:high GC Gram, prok:d-proteobacteria, prok:proteobacteria.

Thanks!

ChathumadaviE commented 1 year ago

PS: Does the contam.fasta file only includes contamination sequences and regions OR does it contain the whole sequence which houses the contaminated region in a small region? If the contam.fasta only contains the contaminated regions /sequences, I can map the reads to that file and filter out the bad reads.

etvedte commented 1 year ago

See the description of contam.fasta here.

If the contam.fasta only contains the contaminated regions /sequences, I can map the reads to that file and filter out the bad reads.

If mapping > filtering is your preferred strategy, I would suggest a modified protocol. The contam.fasta file does not have any sequences with GX assignments TRIM,FIX,REVIEW. Do you have any TRIM or FIX? If the answer is yes, then there are coordinates in your assembly, potentially with mapped reads, that aren't in contam.fasta

You mentioned above there are REVIEW cases. I would do some BLAST searches or other evaluations to see if you think these are true contaminants or false positives. Since the GX aggregate coverage of this crustacean is low, we require a higher coverage threshold to call something as confidently contaminant.

I would also map to the entire original assembly (not just the contaminants) and then use samtools with the -L parameter to output alignments to clean regions and keep those reads or output alignments to contaminated regions and filter those out. Note the GX action report uses 1-based coordinates so you need to subtract 1 from column 2 in that file to get BED format.

ChathumadaviE commented 1 year ago

Thanks for your suggestions, really appreciate it!

Do you have any TRIM or FIX? contam.fasta only contains EXCLUDE and REVIEW. I will look into the REVIEW sequences as you suggested.

Do you prefer a better strategy to accomplish this task than mapping > filtering?

I'm currently mapping the contam.fasta against the reads to filter out the bad reads. Will give a try on mapping to the entire assembly as well.