interedition / collatex

CollateX – Software for Collating Textual Sources
http://collatex.net/
GNU General Public License v3.0
89 stars 39 forks source link

CollateX, bug, now with e-mail #85

Open kkynde opened 1 year ago

kkynde commented 1 year ago

I have an issue on CollateX, a bug actually, in the program, which I would like to present and ask how to deal with. Is there an e-mail to contact - I would like to include a (short) example.

Karsten Kynde karsten@kynde.dk

rhdekker commented 1 year ago

Hi Karsten,

Gregor Middell forwarded your example data files to me. I ran the collation and looked at the internal alignment result, represented as a variant graph, meaning the result independent of the chosen output format and it looks as follows:

alignment_result_colx1_colx2

Which means that CollateX only finds two points of variation. One being a ":" (W1) replaced by a "," (W2). The other being "night" (W1) replaced by "knight" (W2). This seems to be correct to me.

If you agree with this then the question becomes how that internal result should be represented in the requested output format.

In the TEI output there is a <app><rdg wit="w1">Now, It was a dark and stormy</rdg><rdg wit="w2">Now, it was a dark and stormy</rdg></app><app> reading which is what I suspect your report Is about. CollateX doesn't find a meaningful semantic difference here, but it notices a differences in casing here: "it" versus "It". Changes in upper- and lowercasing are by default ignored during alignment, but we have to put them somewhere in the TEI to be able to reconstruct the original witnesses from the output. I think this is what causes the confusion or in other words the difference in expectations.

Before we discuss possible solutions: am I thinking in the right direction so far or is there something else that you wanted to bring to our attention with the example in your report?

Best, Ronald

kkynde commented 1 year ago

Dear Ronald Dekker

Thank you, very much, for your rapid reply. You are indeed thinking in the right direction.

It does confirm my suspicion that change of case somehow is a difference, somehow not.

Your graph is correct in the sense that the different cases does not constitute a 'semantical difference'. Never the less you have saved the 'not semantically different' version (It) somewhere. It is not represented in the graph (nor in the --format graphml output), but it is in the TEI output by two separate elements.

My problem is, that the not semantically difference (it vs. It) this way is mixed up with the truly invariant text surrounding it, which may be very comprehensive. I would have expected either (it and It are different)

Now, <app><rdg wit="w1">It</rdg><rdg wit="w2">it</rdg></app> was a dark and stormy

or (it an It are not different, consistent with the graph)

Now, it was a dark and stormy

I do catch your remark that the latter would prevent you to reconstruct the original witnesses. I also think the former was to prefer (the could be attributed type="notSemanticalDifference"), but you would not be able to construct it from your graph unless you make a recursive collation on the readings.

Have I missed something in the documentation that changes in upper and lower casing are by default ignored during alignment (BTW the same counts for change in spacing)? And does 'by default' mean that I can change it, like suggested in the documentation, by the --script option? If so, how do I do this (back to the first question)?

Yours, Karsten Kynde