marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

UMI '--rename' regex/utility #736

Open joegeorgeson opened 8 months ago

joegeorgeson commented 8 months ago

Hi cutadapt devs,

I want to use cutadapt --rename to extract the meaningful portion of the UMI to the read name, but process the read to cut the entirety of the UMI. If this is currently possible can you let me know how I might change the below to achieve this? If it's not possible, can you add this as a feature in a future version? TIA!

A simple example is where I want to cut the first 7bp from R1 but only add the first 6bp to the read name;

cutadapt -u 7 --rename='{id} {comment} $(echo {cut_prefix} | cut -c1-6)'

What I'm hoping to get is; 1:2101:13928:1000 1:N:0:GTCGCCTT+AAA/ACTAATT NTTTAT

But what is returned is; 1:2101:13928:1000 1:N:0:GTCGCCTT+AAA/ACTAATT $(echo NTTTATT | cut -c1-6)

marcelm commented 8 months ago

Hi, I agree this would be nice to have, but it’s currently not possible.

For the moment, you will have to postprocess your read names. Maybe something like this:

cutadapt -u 7 --rename '{header} {cut_prefix}' input.fastq.gz | \
  awk 'NR%4==1 {$3=substr($3,1,6)};1' | \
  gzip > output.fastq.gz