Could the read name modification be improved?

TomSmithCGAT commented 9 months ago

Hello,

Is there are reason why the read name information needs to be concatenated here?

https://github.com/ulelab/ultraplex/blob/be1841f65bdd14d310dadc4a4927028593d5cd1e/ultraplex/__main__.py#L585

I ask because it breaks downstream tools like umi_tools dedup which use the read names to identify read pairs. Would it be tolerable to add the 'rbc:` string to the end of the first element of the space-delimited read name?

Happy to issue a PR if that would work too. I note this would be a non-backward compatibile change if any other downstream tool/code is dependent upon the current read naming convention.

Delayed-Gitification commented 9 months ago

Hi,

I believe this was implemented to avoid problems with STAR when a space is in the read name.

I am very happy to consider changing this behaviour (or making it optional).

If you have time, feel free to make PR?

Thanks for taking the time to comment

TomSmithCGAT commented 9 months ago

Hmmm... STAR should be able to handle a space in the read name, in the sense that it will not throw an error.

From experience (and a wiki sanity check), for standard illumina fastqs, there's usually a space between the instrument and tile details (unique id, flowcell, lane, tile, tile coordinates) and the rest of the field 1 information (pair info, filtering info, index sequence). The exception being much older files. Archive file formats can also contain spaces in the read name.

I think perhaps the issue you're referring to is that STAR (and other aligners) will only output the first element of the field 1 of the fastq input as the QNAME field in the output, thus ditching everything after the space. If you append the UMI to the end of the fastq field 1 it's therefore lost in alignment. umi_tools extract gets around that by adding the UMIs to the end of the first element of the fastq field 1. I'd propose doing that with ultraplex too, though I'll check that makes sense here as well.

Am I right in understanding that the tests have to be run and assessed manually?

Delayed-Gitification commented 9 months ago

Hi,

Yes I just read more of the STAR manual and indeed it seems the correct way to have dealt with the issue is to append the UMI to the first part of the read name, as you suggest. We found a rather non-optimal workaround.

I'll just check with my colleagues who also maintain this software about whether we agree to append to the first part instead.

Yes, I'm afraid there is no automatic testing. We can handle this modification, no need to make a PR (unless you are very keen!).

TomSmithCGAT commented 9 months ago

Always happier if someone else does the coding 😉

Happy to discuss further here if required though

Delayed-Gitification commented 9 months ago

I've added a new branch called fix_umi_adding, though I haven't had time to test yet. Will test later when I find some time

ulelab / ultraplex

Could the read name modification be improved? #52