Open cjfields opened 1 week ago
Hello,
Thank you for the report.
UniVec is set up as a non-redundant database such that when new sequences are added, they are checked against the current set of sequences and fragments or entire sequences are discarded if they have shared spans with existing records. The process uses an existing UniVec database, then adds new sequence records. So the reason these aren't synchronized is that I had been pulling UniVec sequences into the FCS-adaptor database, and after that added additional labels due to user reports that were confused about the results. In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.
>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter
AAGCAGTGGTATCAACGCAGAGTACT
I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.
Eric
In your example, you likely used the PacBio Ultra Low Input library (based on these results and the apparent PacBio SMRTbell blunt adaptor contamination) which has some shared subsequences with those other UniVec records. But FCS-adaptor might report top matches to NGB00596.1 / NGB00577.1, which was causing confusion for PacBio ULI users.
>gnl|uv|NGB03000.1:1-26 PacBio ULI gDNA amplification adapter AAGCAGTGGTATCAACGCAGAGTACT
I will do a pass over the sequence headers in UniVec and see if anything else like this needs to be added. Hope this helps.
Eric
Thanks Eric, that definitely helps.
FOr context, these are from recent runs on Revio. Revio filtering seems to do a much better job of removing the standard blunt end adapter (NGB00972.1) and C2 primer (NGB00973.1), with only a read or two sneaking through every once in a while, but ULI sequences do appear to get through. In the vast majority of standard genome assemblies these appear to get tossed out if there is enough coverage; however we've done a few PacBio metagenome assemblies and this is where we see more hits using FCS-adapter. We're doing a few tests locally on workflows to help mitigate this, for example setting up a BLASTN screen using vecscreen settings on the reads to see if we can find and trim/remove these prior to assembly.
Describe the bug FCS description for contaminant is not present in UniVec
To Reproduce
Not applicable, FCS run completed. Question is with output and missing data from UniVec database
Software versions (please complete the following information):
Log Files Not applicable, FCS run completed. Question is with output and missing data from UniVec database
Additional context Thanks for a really useful screening tool! I had one simple question that has more is to do with how the FCS adapter database is generated from the UniVec database. We are primarily interested b/c we'd like to ensure we're prescreening libraries before they go into assemblies, but we are likely not capturing sequences back from UniVec that we need.
As an example, we have the following output from
fcs-adapter
for a few recent test assemblies (marked up for this):In the last column, many of the accession descriptions have PacBio ULI which suggests this is likely the culprit for these samples.
However, in UniVec, the important part of that description is missing. For example:
Notice in the last two it's missing the important part:
contains PacBio ULI adapter subsequence
. We can use these results to help set up screening, but any idea where that more complete description is coming from?