Include new labels in uc output for dereplication if --relabel option provided

chlige commented 1 year ago

With this update, if --relabel, --relabel_sha1 or --relabel_md5 are provided with a --derep* command, the new labels of the sequences will be included in the UC output file (column 10). For any 'H' record types, the target will be the new label. For any S or C record types, the target will be the new label and column 9 (label of query sequence) will be the original label of the sequence.

torognes commented 1 year ago

Could you please explain why you want to add this feature? Is there a specific use case for it?

It does not seem to be compatible with uc files generated by USEARCH, and we have strived for compatibility with that tool.

Also, the code have some problems. It does not seem necessary to store the new seqlabel in the hashtable as it can be easily computed. The bp->seqlabel is also not properly initialized in all cases. The seq_digest_sha1 and seq_digest_md5 functions appear unneccessary, as the fprint_seq_digest_sha1 and fprint_seq_digest_md5 functions can be used to print the required info.

chlige commented 1 year ago

Torogens -

We were just looking to be able to track which sequences from the original file would be collapsed into the dereplicated sequences when the relabel option is used. Otherwise you cannot directly match the sequence labels in the output FASTA file with the information in the UC file.

I had added those additional seq_digest_sha1 and seq_digest_md5 functions so that it would be possible to save the sequence label with given to each cluster so that it could be included in the UC file later.

If you would rather not incorporate the change that is fine. We have already compiled our own version of VSEARCH to use in our pipelines and could maintain our own branch in Github if needed.

Thanks.

George Chlipala

On 7 Mar 2023, at 10:13 AM, Torbjørn Rognes @.***> wrote:

Could you please explain why you want to add this feature? Is there a specific use case for it?

It does not seem to be compatible with uc files generated by USEARCH, and we have strived for compatibility with that tool.

Also, the code have some problems. It does not seem necessary to store the new seqlabel in the hashtable as it can be easily computed. The bp->seqlabel is also not properly initialized in all cases. The seq_digest_sha1 and seq_digest_md5 functions appear unneccessary, as the fprint_seq_digest_sha1 and fprint_seq_digest_md5 functions can be used to print the required info.

— Reply to this email directly, view it on GitHub https://github.com/torognes/vsearch/pull/515#issuecomment-1458439671, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE5ZB2ICPUBRDOFHJC4K6TW25NBXANCNFSM6AAAAAAVOA773Q. You are receiving this because you authored the thread.

torognes commented 1 year ago

Thank you for your contribution, but I would rather not include this for the reasons explained above.

I hope vsearch is useful for you anyway.

torognes / vsearch

Include new labels in uc output for dereplication if --relabel option provided #515