interedition / collatex

CollateX – Software for Collating Textual Sources
http://collatex.net/
GNU General Public License v3.0
88 stars 36 forks source link

CollateX refuses Json input #76

Open hlapin opened 3 years ago

hlapin commented 3 years ago

Not sure if this repo is being maintained. Possibly a version of #44 json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment. Sending data to collatex via REST

nonworking.txt working.txt

rhdekker commented 3 years ago

Hi Hayim, I am able to reproduce the error.I am not yet sure what is causing it. The algorithm detects a transposition but then it the processing of the transposition something unexpected happens.

rhdekker commented 3 years ago

During the alignment the algorithm traverses the graph. It turns out that not all the nodes are visited. The graph contains 29 nodes (excluding the start and end vertices) and only 20 (including the start vertex) are visited. The question now becomes why that is the case.

hlapin commented 3 years ago

Thank you for looking at this. It is very vexing b/c it appears unpredictable.

rhdekker commented 3 years ago

I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further.

hlapin commented 3 years ago

Thank you. I have reproduced this with other datasets. If you want another working dataset for comparison please let me know. FWIW, I have checked for hidden control characters etc., but could not find any.

On Sat, Nov 7, 2020 at 5:13 PM Ronald Haentjens Dekker < notifications@github.com> wrote:

I replaced the graph traversal algorithm with a well known true and tested algorithm and it did not change the result. With this specific dataset for some reason not the whole graph is traversed. So I will need to look into it further.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/interedition/collatex/issues/76#issuecomment-723501800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFDTOLR2QRZ6T72BYN6I3SOXBCBANCNFSM4SYRTHDA .

rhdekker commented 3 years ago

Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you?

hlapin commented 3 years ago

I don't have examples that generate the error with a roman font, and since we don't actually know what's causing it I'm not sure how to generate one. I could generate 1:1 character correspondences to a Roman character set, and see if this triggers the same error. As for the transposition, no it does not quite make sense. but S01520 has what is likely missing text (homoioteleuton) at this point, and switching the order of S01520 and P179204 IN THE JSON (this is the sole difference between working.txt and nonworking.txt) triggers the error. Thus (cells in LTR order): image

12 13 14 15 16 17 18 19 20 21 S00483 … שלא בקדושה ולידתו בקדושה והשיני הורתו ולידתו בקדושה וכן S07326 … שלא בקדושה ולידתו בקדושה והשיני הרתו ולדתו בקדושה וכן P179204 … ראשון שלא בקדושה ולידתו בקדושה והשני הורתו ולידתו בקדושה וכן S01520 … ראשון שלא בקדושה ולידתו בקדושה וכן [In case it matters: the prefixes 001-, 002- etc. in the JSON are there to force Collatex to return responses in query order not in alpha order; could handle that in post-processing. nonworking.txt swaps the position in the JSON of 003- and 004- without changing the IDs.]

On Mon, Nov 9, 2020 at 6:08 PM Ronald Haentjens Dekker < notifications@github.com> wrote:

Still investigating. Do you have a dataset that triggers this bug in a roman language by any chance? I understand that this is a weird request maybe, but I have a hard time figuring out what tokens should be aligned or transposed because I can't read the Hebrew text. Right now the algorithm states that 004-P179204:16:'השנ' and 004-P179204:17:'הרת' are transposed compared to the previous witnesses. Does that sound plausible to you?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/interedition/collatex/issues/76#issuecomment-724334420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFDTPU4S2ZSHO7ZOAU7EDSPBY57ANCNFSM4SYRTHDA .

rhdekker commented 3 years ago

Thanks for pointing out the S01520 has what is likely missing text and that transpositions are not expected. That actually gives me a huge hint and a new direction to look into the issue.

rhdekker commented 3 years ago

A short update. I got a bit further in identifying the problem. The algorithm consists of several steps: 1. finding an optimal set of matches. -> 2. Identify transpositions -> 3. mark transpositions in the graph -> 4. graph traversal -> crash. At first I started looking at step 4. But that is not the cause of the crash. Then I turned my attention to step 2. If step 2 ignores a transposition it causes a cycle in the graph causing the traversal to fail. I thought that that might be the problem. But after your previous post indicating that there is a gap in one of the witnesses and no transpositions I realised that the problem is rather that too many transpositions are found. I checked that piece of code multiple times and could not find a mistake. Then I released that the problem might actually be in step 1. Each token of a witness should align with a unique vertex in the graph. It turns out that there is a bug somewhere in the code of step 1 that cause multiple tokens of the witness to be aligned with one and the same vertex. That should not happen. But somehow it does. Causing step 2, 3 and 4 to fail.

hlapin commented 3 years ago

Thanks so much for the update! For the time being I have the work around of using the option "algorithm":"needleman-wunsch". In fact, since I am using JSON tabular output rather than graph output I am not at present actually getting the benefit of detected transpositions.