Open jordane95 opened 4 months ago
Can you share the code you used to decode and expand a bit on what exactly you did to compile these excerpts?
I just add some debug code to the function to produce the resulting document
if duplicates:
text = doc.text
if self.debug:
doc.metadata['duplicates'] = duplicates
doc.metadata['raw_text'] = text
# TODO improve
for d in duplicates:
text = text.replace(d, "")
doc.text = text
Can you share the code you used to decode and expand a bit on what exactly you did to compile these excerpts?
Actually, these examples are quite common, like 2 in 10?
Any idea on this? @guipenedo Could there be any wrong offset at byte-level operation?
Also, I find that some duplicates are decoded into string with a strange ending �
such that it couldn't be matched to the substring in the original text, like this one
"duplicates": [
" 24 hours.\nDeb goes to sleep in t\nhe living room after listening to her husband snore all night when she hears something crash into the front door. What she discovers is a beautiful angel – or is it?\nAs usual, this story has a little twist. This one isn’t really adult themed. You�",
]
"raw_text": "Right now you can get The Circle by Mario Escabar for free on Amazon.com Just go to the link here and enter the gift code PBZ22LYW to download your copy. Offer is limited to the first 120 readers.\nThe plot of the novel The Circle:\nThe famous psychiatrist Solomon Lewin has left his humanitarian work in India to serve as the chief psychiatrist at the Center for Psychological Illness located in London’s Square Mile financial district. Though well paid, the job is monotonous, and Solomon is also going through a rough patch in his marriage with Margaret. He begins scrutinizing the more mysterious cases of the center’s long-term residents hoping to find something worth his time. When he comes across the chart of Maryam Batool, a young broker from London who has lived in the center for seven years, his life will change forever.\nMaryam Batool is an orphan from Pakistan who became one of the most promising female employees of the financial institution General Society, but in the summer of 2007, at the start of the financial crisis, the young broker loses her mind and tries to kill herself. Since then she has been stuck, able only to draw circles yet unable to understand their meaning.\nA snow storm looms over the city at the start of the Christmas holidays. Before Christmas Eve dinner, Solomon receives an urgent call from the center to come at once: Maryam has attacked a nurse and seems to be awakening from her long stupor.\nSolomon heads downtown in the snow, clueless that this will be the most difficult night of his life. The psychiatrist does not trust his patient, the police are after them, and his family seems to be in danger. The only way to protect himself and those he loves is to discover what “The Circle” is and why everyone seems to want his patient dead. It’s a surprise ending and a mystery you won’t believe.\nMy new short story, Sweet Rachel should be available in about 24 hours.\nDeb goes to sleep in t\nhe living room after listening to her husband snore all night when she hears something crash into the front door. What she discovers is a beautiful angel – or is it?\nAs usual, this story has a little twist. This one isn’t really adult themed. You’ll find some violence and one single F word – that’s it.\nI will post an update once it publishes.\nThe new horror anthology is out and you can get it here on Amazon. There are five authors and quite a few short stories that have my favorite ending, a twist. I hope you’ll check it out and tell a friend. If you do grab it, please leave a review on Amazon if you don’t mind.\nTags: horror anthology, Shauna Klein, short stories, twists\nI have another short story released called 10 Second Delay. I hope you all check it out.\nMy newest short story, Grievance is available. I have another available shortly and will post about that one too!\nNew Review of Make a Wish\nYou can find the latest review of my short story, Make a Wish, at this link. Enjoy!\nLeigh M. Lane Interview\nHow did you find out about the Wicked Women Writer Challenge and is it your first time participating?\nI learned about the challen ge through Killion Slade, who was last year’s winner and this year’s hostess. I listened to her winning podcast and loved the different voices and sound effects she used to complement her story. That’s pretty much what sold me.\nThis was my first year participating, actually my first stab at a dramatic podcast, so I had to overcome a small learning curve. The resources and tips Killion provided were very helpful.\nDid you have any challenges writing your story once you got your challenge or did it come easy to you?\nIt came pretty easily once I’d figured out how to piece together the four parts to the challenge, a nanotech invasion taking place in a bullet train, with hand sanitizer as an unlikely tool and extreme itchiness as an untimely disability. It was actually a pretty fun challenge.\nWhat kind of style do you usually write?\nI tend to write with a literary slant regardless of the genre, although I do use a less assuming style with some of my horror. I enjoy writing prose that contains more than just a story, using subtext, symbolism, and form to dig a little deeper beyond the plot. It’s a challenging style, but one that I feel is just as rewarding.\nDo you have anything you are working on now that we should look forward to?\nI’m currently shopping The Private Sector, a political dystopian horror novel that prequels my dark, corporate dystopia, World-Mart. I’d initially sent it out to beta readers with the idea in mind that I would be marketing it as sci-fi with elements of horror, but everyone who’s read it has insisted that it’s more horror with elements of sci-fi. I have a short story in an upcoming circus sideshow-themed anthology, although the release date is still TBA, and I hope to have three or four more anthology contributions to announce soon.\nBio: Leigh M. Lane has been writing for over twenty years. She has ten published novels and twelve published short stories divided among different genre-specific pseudonyms. She is married to editor Thomas B. Lane, Jr. and currently resides in the beautiful mountains of western Montana. Her traditional Gothic horror novel, FINDING POE, was a finalist in the 2013 EPIC Awards in horror.\nHer other novels include THE HIDDEN VALLEY HORROR, inspired by Barker, Bradbury, and King; WORLD-MART, a tribute to Orwell, Serling, and Vonnegut; and the allegorical tale, MYTHS OF GODS.\nFor more information about Leigh M. Lane and her writing, visit her website at http://www.cerebralwriter.com. Leigh also has a Facebook page at https://www.facebook.com/AuthorLeighMLane and Twitter account @LeighMLane.\nJeff Mean would rather set fires than follow rules or observe curfew. He wears his bad boy image like a favorite old hoodie; that is until he learns he has superpowers and is recruited by Super Villain Academy – where you learn to be good at being bad. In a school where one kid can evaporate all the water from your body and the girl you hang around with can perform psychic sex in your head, bad takes on a whole new meaning. Jeff wonders if he’s bad enough for SVA.\nHe may never find out. Classmates vilify him when he develops good manners. Then he’s kidnapped by those closest to him and left to wonder who is good and who is bad. His rescue is the climactic episode that balances good and evil in the super world. The catalyst – the girl he’s crushing on. A girlfriend and balancing the Supers is good, right? Or is it…bad?\nGoodreads * Whiskey Creek Press\nAuthor Kai Stand\nWhen the electricity winked out, Kai Strand gathered her family around the fire and they told stories, one sentence at a time. Her boys were rather fond of the ending, “And then everybody died, the end.” Now an award winning children’s author, Kai crafts fiction for kids and teens to provide an escape hatch from their reality. With a selection of novels for young adult and middle grade readers and short stories for younger children Kai entertains children of all ages, and their adults.\nWebsite * Twitter * Facebook * Blog"
So it's been a while since I took a look at this and the person who made the exactsubstr code is no longer involved with the project, but to me both issues sound like typical byte level issues where there is an offset by one problem.
byte_b
by 1 on the decode linebyte_a
by 1 (- or +, shouldn't make a big difference) and checking if it fixes the text on the weird examples (it should also break the currently working examples). If that is the case then some fix will need to be added to get_duplicate_range
(personally I would even prefer to retokenize the document and get the matching bytes there than to do this back and forth with the text)So it's been a while since I took a look at this and the person who made the exactsubstr code is no longer involved with the project, but to me both issues sound like typical byte level issues where there is an offset by one problem.
- strange ending character (�): there is likely one byte missing at the end to be able to decode this token (you can try incrementing
byte_b
by 1 on the decode line- for the first issue, with the diff text, I fear it might be a similar problem. Could you try changing
byte_a
by 1 (- or +, shouldn't make a big difference) and checking if it fixes the text on the weird examples (it should also break the currently working examples). If that is the case then some fix will need to be added toget_duplicate_range
(personally I would even prefer to retokenize the document and get the matching bytes there than to do this back and forth with the text)
Yeah, I think the strange char is related to some problems with BPE, it is a subword token that couldn't be decoded into one full word. In the original implementation by google, they haven't even decoded the token ids assuming the output tokens are directly feeded for lm training.
I find some bugs in the byte range normalization code which could produce this type of non sense text. I will soon submit a PR to fix this
Could this line be too strict? Some texts are not exactly the same after being encoded and decoded, they only differ by a small margin
Could this line be too strict? Some texts are not exactly the same after being encoded and decoded, they only differ by a small margin
For example, for this text,
text = "Science and computing with Raspberry Pi / Brian R. Kent\n- Author:\n- Kent, Brian R.\n- Published:\n- San Rafael [California] (40 Oak Drive, San Rafael, CA, 94903, USA) : Morgan & Claypool Publishers, [2018]\nBristol [England] (Temple Circus, Temple Way, Bristol BS1 6HG, UK) : IOP Publishing, [2018]\n- Physical Description:\n- 1 online resource (various pagings) : illustrations (some color).\n- Additional Creators:\n- Morgan & Claypool Publishers and Institute of Physics (Great Britain)\nAccess Online\n- Series:\n- Contents:\n- 1. Raspberry Pi -- 1.1. Single-board computing -- 1.2. Why Raspberry Pi?, 2. Setting up your system -- 2.1. Hardware configuration, requirements, and limitations -- 2.2. Understanding Linux -- 2.3. Python -- 2.4. Mathematica and Wolfram Alpha -- 2.5. Sources of astronomical science data -- 2.6. Using revision control -- 2.7. Jupyter notebooks -- 2.8. Coding pedagogy, 3. Chaos and non-linear dynamics -- 3.1. One and two dimensional pseudo random walks -- 3.2. Logistic maps, bifurcation, and chaos -- 3.3. Cellular automata, 4. Physics and astronomy -- 4.1. A simple pendulum -- 4.2. The double pendulum -- 4.3. Hydrostatics -- 4.4. Astronomical catalogs -- 4.5. The Lane-Emden equation -- 4.6. Radiative transfer, 5. Machine learning -- 5.1. Spanning trees -- 5.2. Neural networks and classification, 6. Image combination and analysis -- 6.1. Image manipulation -- 6.2. Creating a multi-wavelength astronomical image -- 6.3. Manipulating astronomical data cubes, and Appendices. -- A. Mathematica shortcuts and help -- B. Important Python modules and resources.\n- Summary:\n- The portable Raspberry Pi computing platform with the power of Linux yields an exciting exploratory tool for beginning scientific computing. Science and Computing with Raspberry Pi takes the reader through explorations in a variety of computing exercises with the physical sciences. The book guides the user through: configuring your Raspberry Pi and Linux operating system; understanding the software requirements while using the Pi for scientific computing; computing exercises in physics, astronomy, chaos theory, and machine learning.\n- Subject(s):\n- ISBN:\n- 9781681749969 ebook\n9781681749938 print\n- Audience Notes:\n- Researcher, student, or hobbyist.\n- Note:\n- \"Version: 20180601\"--Title page verso.\n\"A Morgan & Claypool publication as part of IOP Concise Physics\"--Title page verso.\n- Bibliography Note:\n- Includes bibliographical references.\n- Other Forms:\n- Also available in print.\n- Technical Details:\n- Mode of access: World Wide Web.\nSystem requirements: Adobe Acrobat Reader, EPUB reader, or Kindle reader.\n- Administrative History:\n- Brian R. Kent, PhD is a scientist with the National Radio Astronomy Observatory in Charlottesville, Virginia. His publications and studies in astrophysics and computing include scientific visualizations of a variety of theoretical and observational phenomena. He is interested in visualizing data for scientific analysis, 3D graphics, and introducing scientific programming via single-board computers like Raspberry Pi. Dr. Kent received his PhD in Astronomy and Space Sciences from Cornell University. His website is $ũkent/.\nView MARC record | catkey: 37750428"
Using qwen 1.5 tokenizer for encoded and decode, I find the final sentence with a rare token is incorrectly decoded.
Before:
"ty. His website is $u\u0303kent/.\nView MARC r"
After:
"ty. His website is $\u0169kent/.\nView MARC re"
Before:
ty. His website is $ũkent/.
View MARC r
After:
ty. His website is $ũkent/.
View MARC re
They look exactly the same, but are different in terms of underlying bytes or chars.
>>> tokenizer.encode('$u\u0303').ids
[3, 124310]
>>> tokenizer.encode('$\u0169').ids
[3, 124310]
>>>
Also, see this one
Before:
"alysis], available at\u0308%202005.pdf; Made "
After:
"alysis], available a\u1e97%202005.pdf; Made i"
Before:
alysis], available aẗ%202005.pdf; Made
After:
alysis], available aẗ%202005.pdf; Made I
Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit strange.
For example,
Many duplicates seem no sense after being decoded into text from bytes. Is this normal? Because some of the examples look good.