[BUG]: Differences between FCS-adapter contaminant descriptions and UniVec?

cjfields commented 1 week ago

Describe the bug FCS description for contaminant is not present in UniVec

To Reproduce

Not applicable, FCS run completed. Question is with output and missing data from UniVec database

Software versions (please complete the following information):

OS : CentOS
Docker or Singularity version : singularity v3.8.1
Docker or Singularity FCS image version : 0.5.4

Log Files Not applicable, FCS run completed. Question is with output and missing data from UniVec database

Additional context Thanks for a really useful screening tool! I had one simple question that has more is to do with how the FCS adapter database is generated from the UniVec database. We are primarily interested b/c we'd like to ensure we're prescreening libraries before they go into assemblies, but we are likely not capturing sequences back from UniVec that we need.

As an example, we have the following output from fcs-adapter for a few recent test assemblies (marked up for this):

accession	length	action	range	name
ptg001017l	243162	ACTION_TRIM	195461..195500	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00972.1:Pacific Biosciences Blunt Adapter
s300.ctg000354l	5041	ACTION_TRIM	4771..5041	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
s0.ctg000001l	197550	ACTION_TRIM	197220..197246	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence
s1336.ctg002180l	7757	ACTION_TRIM	7730..7757	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
s576.ctg000580l	44462	ACTION_TRIM	44328..44462	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
ptg000002l	197109	ACTION_TRIM	305..331	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence
ptg000200l	95473	ACTION_TRIM	95394..95473	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
ptg000770l	44454	ACTION_TRIM	44373..44454	CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence

In the last column, many of the accession descriptions have PacBio ULI which suggests this is likely the culprit for these samples.

However, in UniVec, the important part of that description is missing. For example:

$ grep NGB00972.1 UniVec
>gnl|uv|NGB00972.1:1-45 Pacific Biosciences Blunt Adapter

$ grep NGB00596.1 UniVec
>gnl|uv|NGB00596.1:1-52 Evrogen Mint CDS-Gsu adapter

$ grep NGB00577.1 UniVec
>gnl|uv|NGB00577.1:1-57 CLONTECH 3'-RACE CDS Primer A

Notice in the last two it's missing the important part: contains PacBio ULI adapter subsequence. We can use these results to help set up screening, but any idea where that more complete description is coming from?

etvedte commented 1 week ago

Hello,

Thank you for the report.

UniVec is set up as a non-redundant database such that when new sequences are added, they are checked against the current set of sequences and fragments or entire sequences are discarded if they have shared spans with existing records. The process uses an existing UniVec database, then adds new sequence records. So the reason these aren't synchronized is that I had been pulling UniVec sequences into the FCS-adaptor database, and after that added additional labels due to user reports that were confused about the results. In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.

>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter
AAGCAGTGGTATCAACGCAGAGTACT

I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.

Eric

cjfields commented 1 week ago

In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.
>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter
AAGCAGTGGTATCAACGCAGAGTACT
I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.

Eric

Thanks Eric, that definitely helps.

FOr context, these are from recent runs on Revio. Revio filtering seems to do a much better job of removing the standard blunt end adapter (NGB00972.1) and C2 primer (NGB00973.1), with only a read or two sneaking through every once in a while, but ULI sequences do appear to get through. In the vast majority of standard genome assemblies these appear to get tossed out if there is enough coverage; however we've done a few PacBio metagenome assemblies and this is where we see more hits using FCS-adapter. We're doing a few tests locally on workflows to help mitigate this, for example setting up a BLASTN screen using vecscreen settings on the reads to see if we can find and trim/remove these prior to assembly.

ncbi / fcs

[BUG]: Differences between FCS-adapter contaminant descriptions and UniVec? #102