marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

Keep UMI after adapter sequence? #729

Closed jamesboot closed 12 months ago

jamesboot commented 1 year ago

Hello,

We have some sequencing data, single end reads whereby the first 'n' bases are miRNA sequences, we then have a specified adapter sequence, and then after the adapter we have a UMI, after the UMI we have junk bases to the end of the read.

I was just wondering if it is possible in cutadapt to keep the sequence immediately after an adapter, in this case a UMI and then add it to the read header?

Alternatively is it possible to save the trimmed/unwanted sequences to a separate file for further processing?

Because of the variable length of our 'n' bases at the start of the read, and end of the read, we can't specify the exact position of the UMI.

Thanks James

marcelm commented 1 year ago

This is not fully doable just with Cutadapt alone, but the --rename option with {match_sequence} should get you close, see https://cutadapt.readthedocs.io/en/stable/guide.html#read-renaming.

The trick would be to include the UMI in the adapter sequence. So if the adapter is ACGTACGT and the UMI has 8 bases, you would use -a ACGTACGTN{8} (N{8} is understood as NNNNNNNN).

Then you can move the matched sequence into the read header with something like --rename '{header} match={match_sequence}', where {match_sequence} is replaced with the part of the read that matched the adapter. Since that includes the UMI, this also transfers the UMI. Then you would have to modify the matched sequence in the header some other way so that it corresponds to the UMI only.

Example:

echo -e '>readid\natatatACGTGGGGcccc' | \
  cutadapt -a ACGTNNNN --rename '{header} {match_sequence}' -
>readid ACGTGGGG
atatat

Alternatively, use an info file, see https://cutadapt.readthedocs.io/en/stable/reference.html#info-file-format .

jamesboot commented 12 months ago

Hi Marcel,

Many thanks for the reply and suggestion - this worked really well. I've now got the UMIs in the header of my reads so should be able to carry on from there.

Cheers James

marcelm commented 12 months ago

Nice, thanks for the feedback!