Closed conchoecia closed 7 months ago
Hello,
This genome, however, doesn't have the adapter sequences replaced with Ns...when I faidx out the regions that are listed in the contamination report I find that there are still PacBio Blunt Adapter sequences still there
From this description I am guessing that the PacBio adaptor hits are internal contig hits and not near contig ends, else they should have been removed automatically. The contamination report should have sequences labeled ACTION_TRIM (not ACTION_EXCLUDE). Is this correct? Can you post a few rows of the contamination report to verify? See: https://github.com/ncbi/fcs/wiki/FCS-adaptor#rules-for-action-assignment
I think this should be the default behavior
We have received feedback about masking and hope to incorporate it into a future release (See: https://github.com/ncbi/fcs/wiki/FCS-adaptor#known-issues-and-future-improvements). One potential issue with setting hardmask as default is that contigs joined by an adaptor could be a false contig join and it would be better to split into two separate contigs. We do appreciate this feedback and can follow up when this feature is added.
Are there a lot of hits in the contamination report? If so, may want to consider a read adaptor trimming tool then re-assemble. If there are few, you can manually inspect the regions to decide on an appropriate action.
I am guessing that the PacBio adaptor hits are internal contig hits and not near contig ends,
Yep, that's right! Here are the first few lines:
#accession length action range name
Scaffold10 158694913 ACTION_TRIM 15988406..15988522,38401399..38401514,50213067..50213174,64581896..64582005,77408842..77408950,81014300..81014414,82834540..82834647,112891900..112891942,131379424..131379534 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
Scaffold11 148686071 ACTION_TRIM 13529338..13529379,63654492..63654602,91760996..91761100,115538074..115538181,131923719..131923837 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
I counted something on the order of 200 adapters nested in the scaffolds: cat fcs_adaptor_report.txt | cut -f4 | sed 's/./\0\n/g' | grep ',' | wc -l
One potential issue with setting hardmask as default is that contigs joined by an adaptor could be a false contig join and it would be better to split into two separate contigs
Yes, I totally get it! In this case in general I trust the assembly due to our DNAse Hi-C data. But, I understand that making the blanket behavior to hard-mask is not a good thing in most cases since it could be abused easily.
I agree that it would be best to trim the reads first then reassemble, but in this case someone gave me a completed assembly. I will just break the scaffolds at the problematic points then rescaffold. Thanks!
EDIT: (I just checked and it seems like making a PR to a wiki is not possible, I was just going to add something about the cleaned fasta to the FCS-adaptor output
page)
If you attach your suggested edits (paste in comment box or attach text file) we will review and make needed changes. Thanks!
@etvedte I've written a script to parse the report file and split the input assembly at the internal trim sites. Would you mind taking a look? https://github.com/conchoecia/genome_assembly_pipelines/blob/master/scripts/NCBI_fcs_adapter_break.py
One of my concerns is that I didn't write the indexing properly converting from the coordinate system that the report file uses to slicing the python strings: here and here, for example.
A brief review of the code from our team confirmed this could work to split scaffolds at the designated points, following the convention that FCS-adaptor uses in the 1-based coordinate system in contamination reports. You should double check the output against the known PacBio SMRTbell sequence. The sequence in our db:
ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
Note that we currently don't support barcoded PacBio adaptors so if you know you used a barcoding kit with additions to the default loop-stem we won't pick them up, but you would be able to modify your coordinate system appropriately to deal with this.
Also we will update our wiki to let future users know that the reports are 1-based coordinates.
There are now instructions to mask vs. split on adaptor matches in the FCS-adaptor wiki.
Thanks for putting this tool online! It's nice to be able to fix our genomes before uploading them to NCBI and catching the problems then.
I ran the FCS adapter tool (
run_fcsadaptor.sh
) on a genome and got plenty of adapters to trim out. I then noticed thecleaned_sequences
directory and saw that there is a copy of the genome there. This genome, however, doesn't have the adapter sequences replaced with Ns (I think this should be the default behavior), and when I faidx out the regions that are listed in the contamination report I find that there are still PacBio Blunt Adapter sequences still there:TATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTAGAGAGAGATACAGTC
So I'm wondering what is being done to the fasta file in
cleaned_sequences
? For now I think I will just trim the genome with a custom script.