Option for SISPA tag trimming step in workflow

MatFish commented 3 years ago

This tag (5′-GTTTCCCAGTCACGATA-3′) is introduced during the cDNA preparation as a way to universally amplify the cDNA. It is located inside of the adaptor/index sequences so it usually does not get trimmed off and is generally still in our raw reads which can make assembly difficult, especially in low coverage areas. Adding this as an optional step for when we want to analyze ViroCap data would make analysis easier and lead to less manual manipulation of the data.

peterk87 commented 3 years ago

Hey @MatFish would you happen to have the adapter trimming command you typically use for ViroCap data? Are you using cutadapt or Trimmomatic?

I'll try to find a dataset to try out the following:

$ cutadapt -g GTTTCCCAGTCACGATA \
    -o trimmed_reads.fq.gz \
    input_reads.fq.gz

Also, does ViroCap sometimes add multiple instances of the same adapter?

MatFish commented 3 years ago

I previously tried Trimmomatic but wasn't able to get it to work, even giving it the exact sequence. We also had some issues with cutadapt but were able to get it working relatively well by using -times field. It's strange because there shouldn't be multiple instances of the tag. This was quite a while ago so it's possible that something has changed since then.

I don't recall ever seeing multiple instances of the tag and since the tag is being added by a PCR primer, there really should be only 1 unless there is something I'm not considering that is happening.

I was able to dig up a command which is from a bash script I used to use. Hopefully that helps.

$  cutadapt -g XGTTTCCCAGTCACGATA \
      -G XGTTTCCCAGTCACGATA \
      -a TATCGTGACTGGGAAAC$ \
      -A TATCGTGACTGGGAAAC$ \
      -g XTATCGTGACTGGGAAAC \
      -G XTATCGTGACTGGGAAAC \
      -a GTTTCCCAGTCACGATA$ \
      -A GTTTCCCAGTCACGATA$ \
      --times 3 \
      --cores=0 \
      -o $file1_trim -p $file2_trim $file1 $file2

MatFish commented 3 years ago

I was just looking at some ViroCap data and noticed that there are some SISPA tags within the read. They are still toward the end of read, but not necessarily right at the very end of it. Still not seeing any instances of multiple tags.

MatFish commented 3 years ago

Here's a good example of what I see in some of the reads. You can see there in some cases the tag isn't right on the 5' end and there are also tags that do not exactly match the expected tag sequence. This is still present following running the above command I sent (from the looks of it, none of them are a 100% match to the tag so I assume that's why).

peterk87 commented 3 years ago

Thanks @MatFish for the command and visualization.

So basically cutadapt is looking for the adapter or it's reverse complement anywhere in the read or on either end up to 3 times?

Since the adapter is only 17 bp long, might it be found at random multiple times in some reads?

Would we mostly be interested in trimming the adapter from the 5' end?

MatFish commented 3 years ago

I haven't been able to find an instance of it appearing multiple times in a single read. I'm currently looking at some reference mapped ViroCap data, so I will keep an eye out. We initially tried the times command because it wasn't really working when we ran it without, even in instances where there was only 1 tag. Not sure why including that all of a sudden made it work. This was well over a year ago at this point, so it may have been updated since.

I think tags on the ends (you also see RC version of the tag on the 3' end of some reads) are the most important. A few tags sneaking through isn't a problem, but I have seen instances where there are a lot of reads with the tag in the exact sample position and they are throwing off the consensus sequence. During the ViroCap protocol there are several amplification steps, so I assume in some cases these reads are PCR replicates that are being propogated. When you look at the pile-up it's obvious it's just the tag and it's wrong, but in some case there are so many reads with the tag some of the tag bases are incorrectly incorporated into the consensus because they make up the majority of mapped sequences at that position since the section of the read downstream of the tag does match the reference. I also suspect that for de novo assembly, the tag may make it harder to assemble whole genomes because of areas like this where you have a pretty even mix of the tag and the actual genomic sequence. I imagine it would be even worse in particularly low coverage areas. So basically, I don't think we need to get rid 100% of the tags but enough to address the issues above. Hopefully that makes sense!

peterk87 / nf-villumina

Option for SISPA tag trimming step in workflow #21