Open TomSmithCGAT opened 9 months ago
Hi,
I believe this was implemented to avoid problems with STAR when a space is in the read name.
I am very happy to consider changing this behaviour (or making it optional).
If you have time, feel free to make PR?
Thanks for taking the time to comment
Hmmm... STAR should be able to handle a space in the read name, in the sense that it will not throw an error.
From experience (and a wiki sanity check), for standard illumina fastqs, there's usually a space between the instrument and tile details (unique id, flowcell, lane, tile, tile coordinates) and the rest of the field 1 information (pair info, filtering info, index sequence). The exception being much older files. Archive file formats can also contain spaces in the read name.
I think perhaps the issue you're referring to is that STAR (and other aligners) will only output the first element of the field 1 of the fastq input as the QNAME
field in the output, thus ditching everything after the space. If you append the UMI to the end of the fastq field 1 it's therefore lost in alignment. umi_tools extract
gets around that by adding the UMIs to the end of the first element of the fastq field 1. I'd propose doing that with ultraplex
too, though I'll check that makes sense here as well.
Am I right in understanding that the tests have to be run and assessed manually?
Hi,
Yes I just read more of the STAR manual and indeed it seems the correct way to have dealt with the issue is to append the UMI to the first part of the read name, as you suggest. We found a rather non-optimal workaround.
I'll just check with my colleagues who also maintain this software about whether we agree to append to the first part instead.
Yes, I'm afraid there is no automatic testing. We can handle this modification, no need to make a PR (unless you are very keen!).
Always happier if someone else does the coding 😉
Happy to discuss further here if required though
I've added a new branch called fix_umi_adding, though I haven't had time to test yet. Will test later when I find some time
Hello,
Is there are reason why the read name information needs to be concatenated here?
https://github.com/ulelab/ultraplex/blob/be1841f65bdd14d310dadc4a4927028593d5cd1e/ultraplex/__main__.py#L585
I ask because it breaks downstream tools like` string to the end of the first element of the space-delimited read name?
umi_tools dedup
which use the read names to identify read pairs. Would it be tolerable to add the 'rbc:Happy to issue a PR if that would work too. I note this would be a non-backward compatibile change if any other downstream tool/code is dependent upon the current read naming convention.