SEGUID v1: What input validation should be done for --type="seguid"?

louisabraham commented 8 months ago

I see two tests

@test "<CLI call> <<< \"aCGT\" gives error (invalid symbol)" {
    run "${cli_call[@]}" <<< "aCGT"
    assert_failure
}

@test "<CLI call> <<< \" ACGT\" gives error (invalid symbol)" {
    run "${cli_call[@]}" <<< " ACGT"
    assert_failure
}

HenrikBengtsson commented 8 months ago

Are you specifically thinking of SEGUID v1 (--type=seguid) here?

louisabraham commented 8 months ago

Yes. I thought that v1 does not validate input so I'm surprised.

HenrikBengtsson commented 8 months ago

I see. Good point; if we provide a SEGUID v1 implementation, we should align our behavior with that of the original implementation.

HenrikBengtsson commented 6 months ago

I thought about this one a little bit more. I think we could simply document that --type="seguid" is a conservative implementation of SEGUID v1. We could also clarify that our SEGUID v1 does not turn input sequences into all upper case - that is something the user needs to do manually - and our current implementation will produce an error if they don't.

The reason for suggesting this approach is that:

We don't have access to the original reference implementation of SEGUID (v1), so we don't actually know exactly how it works
We don't want to introduce an option to turning input sequences into upper case
We could have SEGUID v1 automatically turn everything into upper case, but again, we don't have the original implementation available, so we risk providing a second-best guess that potentially could be wrong.

Because of this, I think it's better to stay conservative and use the same validation as we use from SEGUID v2. I believe our "conservative" SEGUID v1 implementation is a superset of the original SEGUID v1. We will still be able to relax this in future releases, if we find it necessary.

HenrikBengtsson commented 6 months ago

DECISION: Our implementation SEGUID v1 should remain conservative. It's not a blocker, because anyone can always use --alphabet="..." to work around the validation (modulo some reserved symbols).

Help for seguid() for Python now says:

    The original definition of the SEGUID v1 checksum algorithm (Babnigg & Giometti, 2006)
    included transformation to uppercase before calculating the checksum.
    Here, ``seguid()`` does *not* coerce the input sequence to upper case. If your input sequence
    has lower-case symbols, you can use :meth:`str.upper` to achieve what the original method does.
    ``seguid()`` only accepts symbols as specified by the `alphabet` argument.
    Thus, our implementation is more conservative, which has the benefit of
    lowering the risk of passing the incorrect sequence by mistake.

I'll add the same to the R help.

seguid / seguid-tests

SEGUID v1: What input validation should be done for --type="seguid"? #6