[DOC SUGGESTION]: Not clear what `cleaned_sequences` is in FCS tool

conchoecia commented 11 months ago

Thanks for putting this tool online! It's nice to be able to fix our genomes before uploading them to NCBI and catching the problems then.

I ran the FCS adapter tool (run_fcsadaptor.sh) on a genome and got plenty of adapters to trim out. I then noticed the cleaned_sequences directory and saw that there is a copy of the genome there. This genome, however, doesn't have the adapter sequences replaced with Ns (I think this should be the default behavior), and when I faidx out the regions that are listed in the contamination report I find that there are still PacBio Blunt Adapter sequences still there:

TATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTAGAGAGAGATACAGTC

So I'm wondering what is being done to the fasta file in cleaned_sequences? For now I think I will just trim the genome with a custom script.

etvedte commented 11 months ago

Hello,

This genome, however, doesn't have the adapter sequences replaced with Ns...when I faidx out the regions that are listed in the contamination report I find that there are still PacBio Blunt Adapter sequences still there

From this description I am guessing that the PacBio adaptor hits are internal contig hits and not near contig ends, else they should have been removed automatically. The contamination report should have sequences labeled ACTION_TRIM (not ACTION_EXCLUDE). Is this correct? Can you post a few rows of the contamination report to verify? See: https://github.com/ncbi/fcs/wiki/FCS-adaptor#rules-for-action-assignment

I think this should be the default behavior

We have received feedback about masking and hope to incorporate it into a future release (See: https://github.com/ncbi/fcs/wiki/FCS-adaptor#known-issues-and-future-improvements). One potential issue with setting hardmask as default is that contigs joined by an adaptor could be a false contig join and it would be better to split into two separate contigs. We do appreciate this feedback and can follow up when this feature is added.

Are there a lot of hits in the contamination report? If so, may want to consider a read adaptor trimming tool then re-assemble. If there are few, you can manually inspect the regions to decide on an appropriate action.

conchoecia commented 11 months ago

I am guessing that the PacBio adaptor hits are internal contig hits and not near contig ends,

Yep, that's right! Here are the first few lines:

#accession      length  action  range   name
Scaffold10      158694913       ACTION_TRIM     15988406..15988522,38401399..38401514,50213067..50213174,64581896..64582005,77408842..77408950,81014300..81014414,82834540..82834647,112891900..112891942,131379424..131379534  CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
Scaffold11      148686071       ACTION_TRIM     13529338..13529379,63654492..63654602,91760996..91761100,115538074..115538181,131923719..131923837      CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter

I counted something on the order of 200 adapters nested in the scaffolds: cat fcs_adaptor_report.txt | cut -f4 | sed 's/./\0\n/g' | grep ',' | wc -l

One potential issue with setting hardmask as default is that contigs joined by an adaptor could be a false contig join and it would be better to split into two separate contigs

Yes, I totally get it! In this case in general I trust the assembly due to our DNAse Hi-C data. But, I understand that making the blanket behavior to hard-mask is not a good thing in most cases since it could be abused easily.

I agree that it would be best to trim the reads first then reassemble, but in this case someone gave me a completed assembly. I will just break the scaffolds at the problematic points then rescaffold. Thanks!

EDIT: (I just checked and it seems like making a PR to a wiki is not possible, I was just going to add something about the cleaned fasta to the FCS-adaptor output page)

etvedte commented 11 months ago

If you attach your suggested edits (paste in comment box or attach text file) we will review and make needed changes. Thanks!

conchoecia commented 11 months ago

@etvedte I've written a script to parse the report file and split the input assembly at the internal trim sites. Would you mind taking a look? https://github.com/conchoecia/genome_assembly_pipelines/blob/master/scripts/NCBI_fcs_adapter_break.py

One of my concerns is that I didn't write the indexing properly converting from the coordinate system that the report file uses to slicing the python strings: here and here, for example.

etvedte commented 11 months ago

A brief review of the code from our team confirmed this could work to split scaffolds at the designated points, following the convention that FCS-adaptor uses in the 1-based coordinate system in contamination reports. You should double check the output against the known PacBio SMRTbell sequence. The sequence in our db:

ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT

Note that we currently don't support barcoded PacBio adaptors so if you know you used a barcoding kit with additions to the default loop-stem we won't pick them up, but you would be able to modify your coordinate system appropriately to deal with this.

Also we will update our wiki to let future users know that the reports are 1-based coordinates.

etvedte commented 7 months ago

There are now instructions to mask vs. split on adaptor matches in the FCS-adaptor wiki.

ncbi / fcs

[DOC SUGGESTION]: Not clear what `cleaned_sequences` is in FCS tool #46