marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

extract cut adaptor sequence in the similar way as `{cut_prefix}` #529

Open y9c opened 3 years ago

y9c commented 3 years ago

It is very common that sequencing reads carry UMI index on the 5' or 3' end. If the UMI sequences are fix in length, by using -u argument, the UMI for each sequence can be obtain and put in the read name by a --rename='{id} {cut_prefix}' argument.

For vary length UMI index, there is also a cutadapt based solution. The adjacent primer sequence of the UMI can be assigned as adapter, and the --action=retain argument can remove those UMI index, but keep the sequence content. However, how can I write the cut part, the UMI sequence, into read name? Can cutadapt define a extract {placeholdername} for this?

y9c commented 3 years ago

If there is such {placeholdername}, user can keep the cut_sequence in the read name.

Maybe the trimmed part can be access in this way, {adapter.n1}, {adapter.n2}... n1, n2, ... are the names of the adapters.

y9c commented 3 years ago

relative issue: https://github.com/marcelm/cutadapt/issues/437 https://github.com/marcelm/cutadapt/issues/443

marcelm commented 3 years ago

I wonder how to best implement this. Would it be sufficient if I added placeholders like {before_match} and {after_match}, which would be replaced with the sequence before the first match and the sequence after the first match, respectively? This would work independent of whether --action=retain is specified.

y9c commented 3 years ago

Hi @marcelm,

I am also thinking the implement of this. For those descriptive features, like read name ({id}), read comment ({comment}), adapter name ({adapter_name}), it is straightforward to defined one label for one feature.

But for sequence features, like cut sequence based on position, although we can define {cut_prefix} for the 5' one and {cut_suffix} for the 3' one, the naming is always confusing. We can not tell whether this feature is 'cut' by position or by adapter, meanwhile, it is difficult to tell whether the cut part or the remaining part will be sent to the placeholder. If we add more placeholders for those sequence based features, the situation will become worse. because there will be cross-talk between the adapters and other trimming arguments.

A straightforward solution is index based slicing, user can picker any part of the cut sequence by a index. For example, if sequence is cut by -u 5,-5 arg, {seq[0]} will be the {cut_prefix} and {seq[-1]} will be the {cut_suffix}. For the case mentioned in this post, {seq[0]} will be the 5' UMI and {seq[-1]} will be the 3' UMI, meanwhile the primer pairs can be fetch by {seq[1]} and {seq[-2]}. With this implement, you can also fetch a series of features, like {seq[0:2]} for the trimmed part by the forward primer. Of cause, a number based implement is not that explicit, name of the adapter can also be used, such as {seq["adapter1"]}...

y9c commented 2 years ago

Hi @marcelm, I would like to know if this feature is still on the roadmap?

marcelm commented 2 years ago

Hi @y9c there isn’t really a roadmap, just a bunch of open issues from which I pick one to work on from time to time.

As I said before, I can add something like {before_match} and {after_match} placeholders. That would be relatively simple to implement, but you need to say whether that is helpful in your particular case.

The more general solution with {seq[0]} etc. is possibly a nice idea, but requires a lot of thought and implementation work to make it work in practice, mostly because one has to think through all the possible cases: Which indices should be used if both adapter trimming and --cut is used at the same time? What if quality trimming is enabled, should the quality-trimmed part also also get its own index? What happens if there is no match? How would this work when --times is used (then one adapter can be found multiple times)? All of this also needs to be explained in the documentation.

marcelm commented 2 years ago

Note to self, this works:

>>> "{[1]}".format(["x", "y"])
'y'
>>> "{[a]}".format(dict(a="b"))
'b'

And even this:

>>> class A:
...  pass
>>> a = A()
>>> a.x = 17
>>> "{.x}".format(a)
'17'