seguid / seguid-tests

Unit tests for any SEGUID implementations
https://www.seguid.org
0 stars 0 forks source link

SEGUID v1: What input validation should be done for --type="seguid"? #6

Closed louisabraham closed 6 months ago

louisabraham commented 8 months ago

I see two tests

@test "<CLI call> <<< \"aCGT\" gives error (invalid symbol)" {
    run "${cli_call[@]}" <<< "aCGT"
    assert_failure
}

@test "<CLI call> <<< \" ACGT\" gives error (invalid symbol)" {
    run "${cli_call[@]}" <<< " ACGT"
    assert_failure
}
HenrikBengtsson commented 8 months ago

Are you specifically thinking of SEGUID v1 (--type=seguid) here?

louisabraham commented 8 months ago

Yes. I thought that v1 does not validate input so I'm surprised.

HenrikBengtsson commented 8 months ago

I see. Good point; if we provide a SEGUID v1 implementation, we should align our behavior with that of the original implementation.

HenrikBengtsson commented 6 months ago

I thought about this one a little bit more. I think we could simply document that --type="seguid" is a conservative implementation of SEGUID v1. We could also clarify that our SEGUID v1 does not turn input sequences into all upper case - that is something the user needs to do manually - and our current implementation will produce an error if they don't.

The reason for suggesting this approach is that:

Because of this, I think it's better to stay conservative and use the same validation as we use from SEGUID v2. I believe our "conservative" SEGUID v1 implementation is a superset of the original SEGUID v1. We will still be able to relax this in future releases, if we find it necessary.

HenrikBengtsson commented 6 months ago

DECISION: Our implementation SEGUID v1 should remain conservative. It's not a blocker, because anyone can always use --alphabet="..." to work around the validation (modulo some reserved symbols).

Help for seguid() for Python now says:

    The original definition of the SEGUID v1 checksum algorithm (Babnigg & Giometti, 2006)
    included transformation to uppercase before calculating the checksum.
    Here, ``seguid()`` does *not* coerce the input sequence to upper case. If your input sequence
    has lower-case symbols, you can use :meth:`str.upper` to achieve what the original method does.
    ``seguid()`` only accepts symbols as specified by the `alphabet` argument.
    Thus, our implementation is more conservative, which has the benefit of
    lowering the risk of passing the incorrect sequence by mistake.

I'll add the same to the R help.