seguid / seguid-tcl

SEGUID v2: Checksums for Linear, Circular, Single- and Double-Stranded Biological Sequences
0 stars 0 forks source link

alphabet command line option #6

Closed mwdavis2 closed 2 months ago

mwdavis2 commented 3 months ago

are the predefined alphabet command line options ({DNA}, {protein-extended} etc.) case-sensitive? I made it a case-sensitive match, but I can make the match case-insensitive. Should there be an error if there's an unrecognized braced string?

Are the alphabet lists assumed to be well-behaved, or should there be some error checking here?

HenrikBengtsson commented 3 months ago

are the predefined alphabet command line options ({DNA}, {protein-extended} etc.) case-sensitive? I made it a case-sensitive match, but I can make the match case-insensitive.

Yes, they are.

The reason being that we want to be able to support alphabets that are case-sensitive, such as the expanded epigenetic DNA alphabet (Viner et al., 2024), which has sequences such as TTGmGCAA, AT2hAAAT, and TGCm1m1.

Also, being case-sensitive allows us to check for potential mistakes, e.g. if ACgT is passed to SEGUID, we can assume g is a mistake and ask the user to fix it. We let higher-level APIs and GUIs to decide whether they want to do conversion to upper case automatically or not, but, by design, we do not want that at this low level.

Should there be an error if there's an unrecognized braced string?

Yes, any unknown input, including unknown alphabet specifications, or syntax errors, should be detected and produce an error, e.g.

$ python -m seguid --alphabet="{dNA}" <<< "ACGT"
...
ValueError: Unknown alphabet specification: {dNA}

$ python -m seguid --alphabet="{DNA" <<< "ACGT"
...
ValueError: Unknown alphabet specification: {DNA

I can add a few more tests for this to seguid-tests.

Are the alphabet lists assumed to be well-behaved, or should there be some error checking here?

I'm not sure that I understand the question - can you please clarify?

We have a few tests in place that validate the predefined alphabets, e.g. asserting that all symbols that should be supported are supported. They're at https://github.com/seguid/seguid-tests/blob/8ab16bda5e94a4630887350e48297c0b6527a65f/tests/cli.bats#L560-L598

HenrikBengtsson commented 3 months ago

are the predefined alphabet command line options ({DNA}, {protein-extended} etc.) case-sensitive? I made it a case-sensitive match, but I can make the match case-insensitive.

Re-reading this question, I realized you might have specifically asked about the predefined alphabets. Yes, they are case sensitive, e.g.

$ python -m seguid --alphabet="{DNA}" <<< "ACGT"
seguid=IQiZThf2zKn/I1KtqStlEdsHYDQ

$ python -m seguid --alphabet="{dna}" <<< "ACGT"
ValueError: Unknown alphabet specification: {dna}

Basically, the rule of thumb is: We try to be as strict as possible, whenever allowed. This helps minimize the risk for mistakes silently slipping through without notice.

mwdavis2 commented 3 months ago

Thanks. Your second comment is correct- I was asking if {DNA} is the same or different from {dna}, and if different whether an error needs to be thrown that {dna} is unrecognized. My version is luckily correct- it's doing case-sensitive matching. I'll write an error trap for unrecognized alphabet specification.

My question on well-behaved enumerated alphabet lists is do we trap an error for things like: AT,CG,W A,B,C,DE ABC,DE,FG A,B ,C,D (note the space) A;B;C;D A,,B,C,D supplying a single-stranded alphabet for double-stranded input, or vice versa?

or can we assume the user will not be this error-prone?

HenrikBengtsson commented 3 months ago

My question on well-behaved enumerated alphabet lists is do we trap an error for things like: AT,CG,W A,B,C,DE ABC,DE,FG

These are errors.

They should fail an assertion that all component are of the same length, e.g. AT,CG,W and A,B,C,DE have components of length 1 and 2. ABC,DE,FG has components of length 2 and 3. They should all have a single unique length.

Also, ABC,DE,FG should fail because the components are not of length 1 or length 2, which are the only valid ways to specific alphabets.

A,B ,C,D (note the space)

We're treating this as an error too. The reason being that there is a (teeny) risk that it's due to a input mistake. I say it falls under the "try to be as strict as possible, whenever allowed" objective.

A;B;C;D

Error, because it makes the specification unnecessarily ambiguous. We'll end up with some users using commas and others semicolons, which might add confusion, complicates validated (e.g. what about A,B;C,D?).

A,,B,C,D

Error to, because it might indicate a mistake (e.g. a missing symbol). Adding the assertion that all components should be of the same length, that would catch this because you'll get length 0 and length 1 here.

FWIW, the JavaScript, Python, and R implementations produce an error for all of the above, e.g.

$ python -m seguid --alphabet="A,B ,C,D" <<< "A"
    assert n == n_expected

$ Rscript -e seguid::seguid --alphabet="A;B;C;D" <<< "A"
Error in make_alphabet(parts) : n %in% 1:2 is not TRUE

$ npx seguid --alphabet="A,,B,C,D" <<< "A"
Error: Alphabet contains strings of different lengths

At some point, we should improve on error messages (some of them are not very informative) and also harmonize them across implementations. But we can punt on that - the most important thing is to detect that there are errors.

HenrikBengtsson commented 3 months ago

supplying a single-stranded alphabet for double-stranded input, or vice versa?

If a user supplies a valid single-stranded alphabet, but then inputs a double-stranded sequence, it should be an error. This will most likely happen naturally at some point in the code anyway, because a double-stranded algorithm won't have enough information in the single-stranded alphabet to validate that bases are complementary. Our different implementations detect this in different places in their code, e.g.

$ npx seguid --alphabet="A,C,G,T" <<< "AA;TT"
Error: Invalid sequence

$ python -m seguid --alphabet="A,C,G,T" <<< "AA;TT"
ValueError: Detected symbols ';' in not in the 'alphabet'

$ Rscript -e seguid::seguid --alphabet="A,C,G,T" <<< "AA;TT"
Error in assert_in_alphabet(seq, alphabet = names(alphabet)) : 
  Sequence symbols not in alphabet: [n=1] ‘;’ (not in ‘A’, ‘C’, ‘G’, ‘T’)

Also here, we could and should improve and harmonize error messages, but let's aim for that after the first release milestone.

... or can we assume the user will not be this error-prone?

No, we don't just trust the input to be correct. As mentioned before, higher-level APIs and GUIs could do some automatic cleanup of input if, but not at this low-level API. For instance, in your ApE, which targets double-stranded DNA, you can make more assumptions about user's intent, so you can auto-fix some of the input without risking mistakes passing through.

mwdavis2 commented 3 months ago

Thanks for your extensive replies. I think I've trapped several of these potential errors even though they aren't yet in the test suite.

HenrikBengtsson commented 2 months ago

I think we can close this one. We now have more tests asserting that we catch invalid alphabet specifications, cf. https://github.com/seguid/seguid-tests/blob/main/tests-cli/90.exceptions.bats#L142.