mir3z / texthighlighter

-- NO LONGER MAINTAINED -- TextHighlighter allows you to highlight text on web pages.
MIT License
231 stars 101 forks source link

deserialize fails to restore highlights #16

Open zippy opened 9 years ago

zippy commented 9 years ago

The deserialization code fails to work in all cases. Here is an example taken from http://mir3z.github.io/texthighlighter/demos/serialization.html The following serialization string that the code puts into the console, when de-serialized, doesn't fully deserialize all the highlights and also throws some errors.

In the case below note that the third highlight (of the word "Donec") doesn't get highilghted on deserializing.

[["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>"," ac ","7:3",0,4],["<span class=\"highlighted\" data-timestamp=\"1437222885083\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ipsum","3:1",27,5],["<span class=\"highlighted\" data-timestamp=\"1437222886941\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Donec","3:5",18,5],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ellentesque leo nulla, porta non lectus eu, ullamcorper semper est. Nunc ","3:9",188,73],["<span class=\"highlighted\" data-timestamp=\"1437222903031\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Sed ut pretium leo, quis vehicula diam. Proin nisi metus, elementum ut mi port","13:1",148,78],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","\n                    risus vel","3:11",0,30],["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","suada auctor. Ut ","7:1",140,17],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>"," ante eu mollis. In nec\n                    dui vel mauris lacinia vulputate id nec turpis. Aliquam vestibulum, elit sit amet fringilla\n                    malesuada, quam nunc eleifend nunc, id iaculis est neque pretium libero.\n                ","9:3",0,245],["<span class=\"highlighted\" data-timestamp=\"1437222886052\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","consectetur","3:3",17,11],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","attis tellus. Fusce orci nisi,\n                    ultricies vel hendrerit id, egestas id turpis. Proin cursus diam tortor, sed ullamcorper eros commodo\n                    vitae. Aenean et maximus sapien. Nam felis velit, ullamcorper eu turpis ut, hendrerit accumsan augue.\n                    Nulla et purus sem. Ut at hendrerit purus. ","9:1",213,338],["<span class=\"highlighted\" data-timestamp=\"1437222904383\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ntum vitae\n                    lectus. Phasellus ut purus commodo ante iaculis molestie. Integer turpis felis, pellentesque eu\n                    dignissim vel, sodales vel metus. Aliquam tempus lorem odio. Sed purus arcu, auctor eget sodales\n                    ac, venenatis ac velit. Pra","17:1",222,291],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Phasellus mollis commodo","9:2:0",0,24],["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","egestas elit","7:2:0",0,12],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","convallis","3:10:0",0,9],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","First Name","11:1:1:1:0",0,10],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Last Name","11:1:1:3:0",0,9]]

Here's one of the errors it adds into the console:

"Can't deserialize highlight descriptor. Cause: IndexSizeError: Failed to execute 'splitText' on 'Text': The offset 188 is larger than the Text node's length."

I use this library in a project, and like it very much, but this bug affects my project quite frequently. I've attempted to figure out why the deserialization code has this problem, but haven't been able to, any help you can offer would be greatly appreciated. If you need more sample cases, I'm happy to provide.

mir3z commented 9 years ago

Thanks for reporting. I will do my best to investigate this in a few days.

anandi2i commented 9 years ago

Hi team,

I am also facing this similar problem when i deserialize the following contents. Need some immediate fix

before_deserialize after_deserialize

mir3z commented 9 years ago

@anandi2i Could you somehow share a html document you mentioned above?

anandi2i commented 9 years ago

HTML - Content for the above passage

<p class="para indent"> Many integral proteins are <span class="txt-bold">glycoproteins</span>, proteins with carbohydrate groups attached to the ends that protrude into the extracellular fluid. The carbohydrates are <span class="txt-italic">oligosaccharides</span> (<span class="txt-italic">oligo</span>- = few; -<span class="txt-italic">saccharides</span> = sugars), chains of 2 to 60 monosaccharides that may be straight or branched. The carbohydrate portions of glycolipids and glycoproteins form an extensive sugary coat called the <span class="txt-bold">glycocalyx</span> (glī-kō-KĀL-iks). The pattern of carbohydrates in the glycocalyx varies from one cell to another. Therefore, the glycocalyx acts like a molecular “signature” that enables cells to recognize one another. For example, a white blood cell’s ability to detect a “foreign” glycocalyx is one basis of the immune response that helps us destroy invading organisms. In addition, the glycocalyx enables cells to adhere to one another in some tissues and protects cells from being digested by enzymes in the extracellular fluid. The hydrophilic properties of the glycocalyx attract a film of fluid to the surface of many cells. This action makes red blood cells slippery as they flow through narrow blood vessels and protects cells that line the airways and the gastrointestinal tract from drying out. </p>

Json sent for Deserialization

"[["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","The basic structural framework of the plasma membrane is the ","0:9:3:3:0",0,61,0],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","lipid bilayer","0:9:3:3:1:0",0,13,1],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">",", two back-to-back layers made up of three types of lipid molecules—phospholipids, cholesterol, and glycolipids (","0:9:3:3:2",0,113,2],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","Figure 3.2","0:9:3:3:3:0",0,10,3],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","). About 75% of the membrane lipids are ","0:9:3:3:4",0,40,4],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","phospholipids","0:9:3:3:5:0",0,13,5],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">",", lipids that contain phosphorus. Present in smaller amounts are ","0:9:3:3:6",0,65,6],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","cholesterol","0:9:3:3:7:0",0,11,7],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (about 20%), a steroid with an attached —OH (hydroxyl) group, and various ","0:9:3:3:8",0,75,8],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","glycolipids","0:9:3:3:9:0",0,11,9],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (about 5%), lipids with attached carbohydrate groups.","0:9:3:3:10",0,54,10],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","The bilayer arrangement occurs because the lipids are ","0:9:3:5:0",0,54,11],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","amphipathic","0:9:3:5:1:0",0,11,12],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","(am-fē-PATH-ik) molecules, which means that they have both polar and nonpolar parts. In phospholipids (see ","0:9:3:5:2",0,108,13],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","Figure 2.18","0:9:3:5:3:0:0",0,11,14],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","), the polar part is the phosphate-containing “head,” which is ","0:9:3:5:4",0,63,15],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","hydrophilic","0:9:3:5:5:0",0,11,16],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (","0:9:3:5:6",0,2,17],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","hydro-","0:9:3:5:7:0",0,6,18],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," = water; -","0:9:3:5:8",0,11,19],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","philic","0:9:3:5:9:0",0,6,20],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," = loving). The nonpolar parts are the two long fatty acid “tails,” which are","0:9:3:5:10",0,78,21],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","hydrophobic","0:9:3:5:11:0",0,11,22],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (-","0:9:3:5:12",0,3,23],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","phobic","0:9:3:5:13:0",0,6,24],["<span class=\"highlighted annotation-55acbe32ccf210f054c29fcb\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," = fearing) hydrocarbon chains. Because “like seeks like,” the phospholipid molecules orient themselves in the bilayer with their hydrophilic heads facing outward. In this way, the heads face a watery fluid on either side—cytosol on the inside and extracellular fluid on the outside. The hydrophobic fatty acid tails in each half of the bilayer point toward one","0:9:3:5:14",0,361,25],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","Many integral proteins are ","0:9:5:7:0",0,27,26],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","glycoproteins","0:9:5:7:1:0",0,13,27],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">",", proteins with carbohydrate groups attached to the ends that protrude into the extracellular fluid. The carbohydrates are ","0:9:5:7:2",0,123,28],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","oligosaccharides","0:9:5:7:3:0",0,16,29],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (","0:9:5:7:4",0,2,30],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","oligo","0:9:5:7:5:0",0,5,31],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","- = few; -","0:9:5:7:6",0,10,32],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","saccharides","0:9:5:7:7:0",0,11,33],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," = sugars), chains of 2 to 60 monosaccharides that may be straight or branched. The carbohydrate portions of glycolipids and glycoproteins form an extensive sugary coat called the ","0:9:5:7:8",0,180,34],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">","glycocalyx","0:9:5:7:9:0",0,10,35],["<span class=\"highlighted annotation-55af40b8ccf2289dcaa8ff86\" type=\"8\" style=\"background-color: rgb(135, 206, 250);\">"," (glī-kō-KĀL-iks). The pattern of carbohydrates in the glycocalyx varies from one cell to another. Therefore, the glycocalyx acts like a molecular “signature” that enables cells to recognize one another. For example, a white blood cell’s ability to detect a “foreign” glycocalyx is one basis of the immune response that helps us destroy invading organisms. In addition, the glycocalyx enables cells to adhere to one another in some tissues and protects cells from being digested by enzymes in the extracellular fluid. The hydrophilic properties of the glycocalyx attract a film of fluid to the surface of many cells. This action makes red blood cells slippery as they flow through narrow blood vessels and protects cells that line the airways and the gastrointestinal tract from drying out.","0:9:5:7:10",0,790,36]]"

zippy commented 9 years ago

Here is a short screencast that demonstrates how to reliably recreate the problem on http://mir3z.github.io/texthighlighter/demos/serialization.html


Note that the third highlighted word in the first line doesn't get deserialized correctly.

mir3z commented 9 years ago

I think I found a way to fix some issues with serialization but I have limited time and resources to do manual testing. The change are on branch s11n.

@zippy I will be grateful if you provide me more test cases like the one above. With changes I made your test case succeeds now.

@anandi2i Could you check if your issues are still present with the version from branch s11n? I'm unable to test your case unless I have full html document.

Please checkout the highlighter directly from github.

zippy commented 9 years ago

So far it looks good, I can't get it to fail at first pass. I will try a bunch more scenarios.

However, on the down side, this fix, doesn't work on items that had been serialized using the version on the current master branch. I.e. those serializations are broken, and still don't de-serialize correctly by the new code. If there is any way you could get previous serializations to be deserialized correctly that would be incredible.

mir3z commented 9 years ago

However, on the down side, this fix, doesn't work on items that had been serialized using the version on the current master branch. I.e. those serializations are broken, and still don't de-serialize correctly by the new code.

Are you sure about that? Version from branch should handle correctly old serializations. I made one test in which I took old serializations from you first post:

[["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>"," ac ","7:3",0,4],["<span class=\"highlighted\" data-timestamp=\"1437222885083\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ipsum","3:1",27,5],["<span class=\"highlighted\" data-timestamp=\"1437222886941\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Donec","3:5",18,5],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ellentesque leo nulla, porta non lectus eu, ullamcorper semper est. Nunc ","3:9",188,73],["<span class=\"highlighted\" data-timestamp=\"1437222903031\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Sed ut pretium leo, quis vehicula diam. Proin nisi metus, elementum ut mi port","13:1",148,78],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","\n                    risus vel","3:11",0,30],["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","suada auctor. Ut ","7:1",140,17],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>"," ante eu mollis. In nec\n                    dui vel mauris lacinia vulputate id nec turpis. Aliquam vestibulum, elit sit amet fringilla\n                    malesuada, quam nunc eleifend nunc, id iaculis est neque pretium libero.\n                ","9:3",0,245],["<span class=\"highlighted\" data-timestamp=\"1437222886052\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","consectetur","3:3",17,11],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","attis tellus. Fusce orci nisi,\n                    ultricies vel hendrerit id, egestas id turpis. Proin cursus diam tortor, sed ullamcorper eros commodo\n                    vitae. Aenean et maximus sapien. Nam felis velit, ullamcorper eu turpis ut, hendrerit accumsan augue.\n                    Nulla et purus sem. Ut at hendrerit purus. ","9:1",213,338],["<span class=\"highlighted\" data-timestamp=\"1437222904383\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","ntum vitae\n                    lectus. Phasellus ut purus commodo ante iaculis molestie. Integer turpis felis, pellentesque eu\n                    dignissim vel, sodales vel metus. Aliquam tempus lorem odio. Sed purus arcu, auctor eget sodales\n                    ac, venenatis ac velit. Pra","17:1",222,291],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Phasellus mollis commodo","9:2:0",0,24],["<span class=\"highlighted\" data-timestamp=\"1437222899270\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","egestas elit","7:2:0",0,12],["<span class=\"highlighted\" data-timestamp=\"1437222897406\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","convallis","3:10:0",0,9],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","First Name","11:1:1:1:0",0,10],["<span class=\"highlighted\" data-timestamp=\"1437222901358\" style=\"background-color: rgb(255, 255, 123);\" data-highlighted=\"true\"></span>","Last Name","11:1:1:3:0",0,9]]

And deserialized them with the use of code from the branch. As a result I received correctly deserialized highlights.

zippy commented 9 years ago

Well, I see that in some cases it works, but in others not. For example, here's a gist with html and a serialization on that html (created using the master branch, not the fix) :


The very last highlight on the word "son" fails to be deserialized even using the code from the new branch. It deserializes all the other highlights but then throws this error:

"Can't deserialize highlight descriptor. Cause: TypeError: node is undefined"

zippy commented 9 years ago

Unfortunately the new fix doesn't seem to work in all cases. Below is a URL to gist of some HTML and a highlight serialization created over the weekend by one of the users of my website. I also included in the gist the error messages produced in the console. If you need more help repeating this bug, please tell me.


zippy commented 9 years ago

Hi, any progress on this. Do you need any more examples? I do have more and more examples being found by my users...

mir3z commented 9 years ago

Sorry, I don't have much free time. I will look at this at weekend but I'm wondering if I manage to fix all the issues. 12 sie 2015 21:41 "Eric Harris-Braun" notifications@github.com napisał(a):

Hi, any progress on this. Do you need any more examples? I do have more and more examples being found by my users...

— Reply to this email directly or view it on GitHub https://github.com/mir3z/texthighlighter/issues/16#issuecomment-130423091 .

pavneet9 commented 9 years ago

I have created a chrome extension based on your library. And a lot of time i get the same error. Here is the error.

Can't deserialize highlight descriptor. Cause: TypeError: Cannot read property 'splitText' of undefined

If you are still working on the library i can send you the extension and a list of use case example in which this error appears

add1ct3dd commented 7 years ago

Did anyone ever get anywhere with this issue?

diesel167 commented 3 years ago

Is this problem still not fixed?