ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
109 stars 14 forks source link

[BUG]: Differences between FCS-adapter contaminant descriptions and UniVec? #102

Open cjfields opened 1 week ago

cjfields commented 1 week ago

Describe the bug FCS description for contaminant is not present in UniVec

To Reproduce

Not applicable, FCS run completed. Question is with output and missing data from UniVec database

Software versions (please complete the following information):

Log Files Not applicable, FCS run completed. Question is with output and missing data from UniVec database

Additional context Thanks for a really useful screening tool! I had one simple question that has more is to do with how the FCS adapter database is generated from the UniVec database. We are primarily interested b/c we'd like to ensure we're prescreening libraries before they go into assemblies, but we are likely not capturing sequences back from UniVec that we need.

As an example, we have the following output from fcs-adapter for a few recent test assemblies (marked up for this):

accession length action range name
ptg001017l 243162 ACTION_TRIM 195461..195500 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
s300.ctg000354l 5041 ACTION_TRIM 4771..5041 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
s0.ctg000001l 197550 ACTION_TRIM 197220..197246 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence
s1336.ctg002180l 7757 ACTION_TRIM 7730..7757 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
s576.ctg000580l 44462 ACTION_TRIM 44328..44462 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
ptg000002l 197109 ACTION_TRIM 305..331 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence
ptg000200l 95473 ACTION_TRIM 95394..95473 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
ptg000770l 44454 ACTION_TRIM 44373..44454 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence

In the last column, many of the accession descriptions have PacBio ULI which suggests this is likely the culprit for these samples.

However, in UniVec, the important part of that description is missing. For example:

$ grep NGB00972.1 UniVec
>gnl|uv|NGB00972.1:1-45 Pacific Biosciences Blunt Adapter

$ grep NGB00596.1 UniVec
>gnl|uv|NGB00596.1:1-52 Evrogen Mint CDS-Gsu adapter

$ grep NGB00577.1 UniVec
>gnl|uv|NGB00577.1:1-57 CLONTECH 3'-RACE CDS Primer A

Notice in the last two it's missing the important part: contains PacBio ULI adapter subsequence. We can use these results to help set up screening, but any idea where that more complete description is coming from?

etvedte commented 1 week ago

Hello,

Thank you for the report.

UniVec is set up as a non-redundant database such that when new sequences are added, they are checked against the current set of sequences and fragments or entire sequences are discarded if they have shared spans with existing records. The process uses an existing UniVec database, then adds new sequence records. So the reason these aren't synchronized is that I had been pulling UniVec sequences into the FCS-adaptor database, and after that added additional labels due to user reports that were confused about the results. In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.

>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter
AAGCAGTGGTATCAACGCAGAGTACT

I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.

Eric

cjfields commented 1 week ago

In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.

>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter
AAGCAGTGGTATCAACGCAGAGTACT

I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.

Eric

Thanks Eric, that definitely helps.

FOr context, these are from recent runs on Revio. Revio filtering seems to do a much better job of removing the standard blunt end adapter (NGB00972.1) and C2 primer (NGB00973.1), with only a read or two sneaking through every once in a while, but ULI sequences do appear to get through. In the vast majority of standard genome assemblies these appear to get tossed out if there is enough coverage; however we've done a few PacBio metagenome assemblies and this is where we see more hits using FCS-adapter. We're doing a few tests locally on workflows to help mitigate this, for example setting up a BLASTN screen using vecscreen settings on the reads to see if we can find and trim/remove these prior to assembly.