ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
101 stars 13 forks source link

[BUG]: FCS-adaptor doesn't find all vec contam in one run #47

Closed conchoecia closed 1 year ago

conchoecia commented 1 year ago

Describe the bug This is a separate issue that I found after I raised questions about documentation here https://github.com/ncbi/fcs/issues/46 .

The bug is that FCS-adaptor doesn't find all of the contaminating vector sequences in the first round of searches. It seems to me like it fails in the case where there are two vector sequences close to one another, and especially if there are partial vector sequences.

To Reproduce Step 1. I ran FCS-adaptor on a genome someone sent me. I ran this script on the genome + the vector contamination report to break the scaffolds at, and exclude, the vector contamination sequences.

Step 2. To verify that I had removed everything I ran FCS-adaptor again on the output of step 1. The result was that the tool found adapter sequences in four scaffolds.

I wanted to know why it failed, so I compared the two results files. In file 2 I found this entry, then I saw that in file 1 there was a nearby entry:

# entry in file 2
Scaffold26::piece1      3619672 ACTION_TRIM     3619635..3619672        CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
# entry in file 1 
Scaffold26      110670011       ACTION_TRIM     3619672..3619716

These two regions are close to one another, so I extracted this and the flanking sequence:

>Scaffold26:3619500-3620000
ACTGCTCTGATAAGCTTCATTCCAAGGTGCCCCAGCTGTTCATCTAGAGGTACTAAACAA
AATCAGAAGAAGGATTTGTAATATAATCAGTCCTGAATTGGCCTCTCGACTTCAGATACT
TTCTCACTGATGTGATGTTGTTGTTGAGAGAGATAACGCAGCGCTCGACTGTATCTCTCT
CTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGATGAATTTAGTGTGCTACTAGATTG
GCATCTAGGTCCCACCCTTTTACAGTTGAAATAGTTAAGTGTAACCAGCACTTCTACAAC
AGAAGTTTCTTTGCTCACTGTTGTGGAAATCTCTCCCATTGCCTTGCTTCCCTGATAACT
ACCATCTTCAAAAATTTAAATGTAACATCAATCGTCATCTTATGTTGCCTTGATCACTGT
TCTTTTCTCCCTTTCCACATATTCCTCTATGCCTGCAATTTTCTCTTACCTACAATCTCC
ATAACCCCCTTCCTGCGCACT

I didn't find two complete PacBio adapters there, so I just extracted the sequences that FCS-adaptor identified, plus the Pacbio adapter you pasted here, and aligned them.

ROUND 2 Scaffold26:3619635-3619672      --------------------------TGTTGTTGTTGAGAGAGATAACGCAGCGCTCGAC    34
ROUND 1 Scaffold26:3619672-3619716      ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT---------------    45
        PacBioBluntAdapter              ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT---------------    45
                                                          *******************               

ROUND 2 Scaffold26:3619635-3619672      TGTA    38
ROUND 1 Scaffold26:3619672-3619716      ----    45
        PacBioBluntAdapter              ----    45

So, what I found is that the tool finds the best local match in the first round, then in the second round the tool found a 19-mer match of the Pacbio sequence adapter. I'm not sure if this is by design, but it seemed worth reporting in that many people may not try to run the tool more than once.

The other sequences I found were these:

CASE 2:

Case 2 was a Pacbio adapter that had some As in the middle.

# entry in fcs-adaptor results
Scaffold35::piece2      11198429        ACTION_TRIM     11198360..11198429      CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter

# sequence
>Scaffold35::piece2:11198360-11198429
ATCTCTCTCAGACAACAACAACGGAGGAAGGGAGGAAAAAGAGAGAGATACAGTCGAGCG
CTGGCGTCCT

# alignment with the pacbio adapter
Scaffold35::piece2:11198360-11198429      --ATCTCTCTCAGACAACAACAACGGAGGAAGGGAGGAAAAAGAGAGAGATACAGTCGAG  58
PacBioBluntAdapter                        ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTG------AGAGAGAT---------  45
                                             *******    *  *  *  **  *     *  *      ********         

Scaffold35::piece2:11198360-11198429      CGCTGGCGTCCT  70
PacBioBluntAdapter                        ------------  45

CASE 3:

Not sure what happened here.

# entry in fcs-adaptor results
Scaffold40::piece3      7321387 ACTION_TRIM     7321349..7321387        CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter

# sequence
>Scaffold40::piece3:7321349-7321387
GGAGGAGGAGGAAAAGAGAGAACGCAGCGCTCGACTGTA

# alignment with the pacbio adapter
Scaffold40::piece3:7321349-7321387      -----------------------GGAGGAGGAGGAAAAGAGAGAACGCAGCGCTCGACTG    37
PacBioBluntAdapter                      ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT---------------    45
                                                                *  *  *  *   *******                

Scaffold40::piece3:7321349-7321387      TA  39
PacBioBluntAdapter                      --  45

CASE 4:

Partial matches?

# entry in fcs-adaptor results
Scaffold8::piece4       2378747 ACTION_TRIM     2378680..2378747        CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter

# sequence
>Scaffold8::piece4:2378680-2378747
TCTCTCAACAACAACAACGGAGGGGAGGGGAGGAAAAGAGAGAGATACAGTCGAGCGCTG
CGACTGTA

# alignment with the pacbio adapter

Scaffold8::piece4:2378680-2378747      ---TCTCTCAACAACAACAACGGAGGGGAGGGGAGGAAAAGAGAGAGATACAGTCGAGCG 57
PacBioBluntAdapter                     ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTT----GTTGAGAGAGAT----------- 45
                                          ******     *  *  *   *  *  *         *********           

Scaffold8::piece4:2378680-2378747      CTGCGACTGTA  68
PacBioBluntAdapter                     -----------  45

Software versions (please complete the following information):

etvedte commented 1 year ago

You're right, these seem suspicious. Did you try reverse-complementing the query sequence?

I can look into these next week.

conchoecia commented 1 year ago

I didn't try to revcomp anything, no, I just ran the tool twice to see if it picked up anything else on the second run that it missed on the first. Thanks!

etvedte commented 1 year ago

Hello,

I looked through these hits and believe I've found the likely source of adaptor reporting on your split sequences:

Cases 2, 3 and 4 produce stronger matches when the query is reverse-complemented and aligned against the PacBio adaptor. Cases 2 and 4 have longer, stronger matches vs Case 3. I'd recommend using the NCBI web VecScreen tool https://www.ncbi.nlm.nih.gov/tools/vecscreen/ to visualize matches. It's similar to FCS-adaptor, but uses slightly different databases and doesn't do any cleaning. They both have the same PacBio adaptor for now.

Case 1 might be related to the other cases but also benefits from having the larger sequence context and the before/after reports. In round 1: Scaffold26:3619672-3619716 produces a strong match whereas the 19-mer upstream is too weak to be reported. But in round 2, since the 19-mer is seen by FCS-adaptor as being near the end of a contig (since the sequences were just split) whereas before it was in an internal span, the scoring threshold for reporting becomes less stringent allowing the 19-mer to be a newly called contaminant span.

While this explains the new calls in Cases 1 and 3, what I don't quite understand is why the relatively strong matches in Cases 2 and 4 were not found in round 1 of screening. Can you post the larger sequence spans corresponding to these regions like you did with Case 1, including the spans called contaminant in rounds 1 and 2 with a small sequence buffer on both ends?

If these hits from round 2 are noise it might go to explain how the adaptor sequences were assembled here if they share some sequence similarity with nearby regions. An additional step you could try would be to map reads (preferably non- PacBio) to the regions and see what sort of read support you get.

tsa4a12 commented 1 year ago

According to trinity_community_codebase (https://github.com/trinityrnaseq/trinity_community_codebase/wiki/Removing-adapters-from-Trinity-Transcriptome-assemblies) and their custom script, they also tried to develop a screening tool similar to FCS-adaptor and it says

"

   -------------------------->  transcript
   -A> -B>                      two copies of the adapter

The blastn command only returns a single alignment. If that is the A alignment the B adapter will still be present. If this happens run another cycle of this script. Note, that usually happens, expect to run two cycles. (The example above was from an earlier version of the script, current versions remove all of the type (3) contaminants in the first pass, and type (2) contaminants are rare.) "

So i assume the same goes for this tool, for my recent experience (561Mb), the FCS-adaptor tool removed 13486 adapters on the first run and 245 on the second run. So according to the decay trend it seems that 2 runs should be enough in most cases, but unlucky runs may need three iterations of the workflow (and painful submission to the TSA for checking). There might also be need to manually remove internal matches..

etvedte commented 1 year ago

Hello,

Thank you for this additional comment. I'm not familiar with this custom Trinity pipeline, but indeed the takeaway here is that there are some situations where FCS-adaptor will identify additional spans to TRIM/EXCLUDE in a second run of the pipeline. This could be due to concatenated copies of the same adaptor or multiple adaptors. We do have logic in place to handle multiple adaptor spans in close proximity, but it may not identify all cases.

I also want to reiterate the point that you should look at the set of adaptors that are being reported in a second run of the tool. If the hits are from the same adaptor or same adaptor family as what was identified in the first run, you can be more confident of the results.

There might also be need to manually remove internal matches..

We are currently working on a solution to better handle internal matches.