huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.77k stars 109 forks source link

Potential issues in substring dedup #121

Open jordane95 opened 4 months ago

jordane95 commented 4 months ago

Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit strange.

For example,

{
  "id": "sha1:222JEVYQHVGRTKDSUSPOJEVUETA5AEO6",
  "data": [
    {
      "meta": {
        "bucket": "head",
        "date_download": "2021-01-26T19:40:47Z",
        "language": "en",
        "language_score": 0.86,
        "perplexity": 286.2,
        "source_domain": "microbewiki.kenyon.edu",
        "title": "Difference between revisions of \"Streptococcus salivarius\" - microbewiki",
        "url": "https://microbewiki.kenyon.edu/index.php?title=Streptococcus_salivarius&diff=132493"
      },
      "text": "*Include as many headings as are relevant to your microbe. Consider using the headings below, as they will allow readers to quickly locate specific information of major interest*\n=3. Genome structure=\nBacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus salivarius\n=6. Ecology=\n=7. Pathology=\n''Streptococcus salivarius''\n=7. Key microorganisms=\n''Streptococcus salivarius'' is the principal commensal bacterium of the oral cavity in humans. ''S. salivarius''http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]]. It therefore seems to be the pioneer in colonizing dental plaque, it creates favorable conditions so other species can begin to colonize. It is also a bacterium which plays the role of moderator, permitting the implantation of bacteria which are harmful to the health of the oral cavity.\nBetter knowledge of the molecular and physiologic factors which allow it to colonize dental plaque and to interact with other species will help in designing strategies for the prevention of cavities, especially in children [http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]]. Also, greater knowledge of this organism can help with research on mouth odor.\nMoreover, when this bacterium enters the bloodstream it is found that it may cause septicemia in neutropenic patients, a condition that shows a abnormal low level amounts of neutrophils in the blood. Neutrophils are also known as white blood cells and are involved in the body’s immune response to infections [http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]]. Also, ''Streptococcus salivarius'' is used to treat patients with atypical pneumonia, which is an illness of the lungs where they become flooded with fluid.\nNot much is known about the genome of ''Streptococcus salivarius'' other than its genome size is estimated to be 1800kb long. Its genome is yet to be sequenced [http://jb.asm.org/cgi/content/abstract/185/2/683 [7]], but it is in progress. A closely related species of ''S. salivarius, S. thermophilus'' has been sequenced. Its genome size has been determined to be 1796kb on a single circular chromosome [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]]. ''S. thermophilus'' is a lactic acid bacterium used for making milk and yogurt in the dairy industry. It was important to sequence ''S. thermopilus'' because it is phylogenetically close to pathogenic streptococci. The genome was sequenced using random shotgun sequencing and followed up by multiplex PCR [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]].\n''S. thermophilus'' has a 39% G-C content, 6 Ribosomal RNA's, and 67 tRNA's [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html[8]]. It is also known that 10% of the genes are not functional due to frameshifts, nonsense mutations, deletions, or pseudogenes. Frameshifts can occur in a genome when one or two nucleotides are deleted or inserted next to each other. This would cause a shift in the reading frame, the frame in which DNA gets transcribed into RNA. A pseudogene is a gene where it becomes transcribed and translated but it has no functional capabilities. Moreover, 30% of their genome is dedicated to energy metabolism and 60% to atypical, phages, and transposons [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]]. Transposons are given the name \"jumping genes\" or mobile genetic elements because of their ability to move around in the genome. They may cause mutation and they may increase the amount of DNA in the genome.\n''S. salivarius'' ''S. salivarius'' is approximately 2 µm in length. The cocci usually occur in pairs and short chains. They are facultative anaerobes and either non- or alpha hemolytic on blood agar [http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]]. Blood agar is used in labs to detect hemolytic activity.\n''S. salivarius'' contains fimbriae on their cell surface. Fimbriae are hair-like appendages that are composed of protein subunits with diameters ranging from 2-8 nm. Fimbriae are involved in co-aggregation of ''S. salivarius'' with the periodontopathogen ''Prevotella intermeida ''[http://mic.sgmjournals.org/cgi/content/full/150/1/189?view=long&pmid=14702412 [2]].\nThe hydrolysis of urea by urease enzymes of oral bacteria like ''Streptococcus salivarius'' has a major impact on oral microbial ecology and is involved in oral health and diseases. The ability to genetically engineer plaque bacteria that can modulate environmental pH through ureolysis will open the way to using ''S. salivarius'' to test hypotheses regarding the role of oral ureolysis in dental caries, calculus formation, and periodontal diseases. This organism may eventually prove useful for controlling dental caries by replacement therapy [http://iai.asm.org/cgi/content/abstract/64/2/585 [1]].\nDiseases may be caused if ''S. salivarius'' enters the blood stream. This may occur during dental work or brushing of the teeth. ''S. salivarius'' may cause septicemia in neutropenic patients. Septicemia is a systemic disease caused by pathogenic organisms or their toxins in the blood stream, it is also known simply as blood poisoning.\n''Streptococcuss salivarius'' is infrequently pathogenic. Viridans streptococci species cause most dental caries and are the most frequent cause of subacute native valve bacterial endocarditis, typically associated with dental procedures [http://jb.asm.org/cgi/reprint/51/6/717 [9]]. Endocaritis is an inflammation of the inner layer of the heart, the endocardium. The severity of the disease is typically based on the microorgansim involved. In the case of Streptococci the disease is labeled as subacute bacterial endocarditis, which is due to the bacterias low virulence, but in the case of the acute bacterial endocarditis it is caused by ''Staphylococcus aureus'' which has a much greater virulence [http://jb.asm.org/cgi/reprint/51/6/717 [9]].\n''Streptococcus salivarius'' secretes a glucosltransferase (Gtf) which forms a glucan from sucrose. ''S. salivarius'' is one of the main sources of Gtf in saliva and in the acquired pellicle is believed to be from ''S. salivarius'' ''S. salivarius'' at sites distant from the tooth surface may aid in the initial attachment or entrapment of other oral species on newly erupted tooth surfaces or on tooth surfaces following prophylaxis. [http://iai.asm.org/cgi/content/abstract/63/2/609 [4]]\n''S. salivarius'' is also known to secrete an enzyme called urease. Urease can catalyze the hydrolysis of urea to ammonia and carbon dioxide [http://iai.asm.org/cgi/content/abstract/64/2/585 [1]].\nA new research found results that suggest Gram-positive micro-organisms such as ''S. salivarius'' contribute to oral malodor production by deglycosylating salivary glycoproteins, thus exposing their protein core to further degradation by Gram-negative micro-organisms. Studies show a direct link between levels of ''Streptococcus salivarius'' in the mouth, throat and tonsils and the development of halitosis [http://jdr.iadrjournals.org/cgi/content/abstract/85/10/910 [3]]. Current research is being done to better understand mouth odor in relation to ''S. salivarius''.\nAlso as mentioned previously in the Ecology section, further studies are being performed to be able to prevent dental caries.\n[http://iai.asm.org/cgi/content/abstract/64/2/585 [1]] Chen, YY. \"Streptococcus salivarius urease: genetic and biochemical characterization and expression in a dental plaque streptococcus.\" Infection and Immunity.1996.Volume 64 No.2. p. 585-592.\n[http://mic.sgmjournals.org/cgi/content/full/150/1/189?view=long&pmid=14702412 [2]] Lévesque, Céline, ChristianVadeboncoeur, and MichelFrenette. \"The csp operon of Streptococcus salivarius encodes two predicted cell-surface proteins, one of which, CspB, is associated with the fimbriae\". Microbiology 150.2004. (Pt 1). p. 189-98.\n[http://jdr.iadrjournals.org/cgi/content/abstract/85/10/910 [3]] N. Sterer1, and M. Rosenberg \"Streptococcus salivarius Promotes Mucin Putrefaction and Malodor Production by Porphyromonas gingivalis\".2006.Journal of Dental Reserach. p. 910-914.\n[http://iai.asm.org/cgi/content/abstract/63/2/609 [4]] Simpson, CL. \"Streptococcus salivarius ATCC 25975 Possesses at Least Two\nGenes Coding for Primer-Independent Glucosyltransferases\".Infection and Immunity.1995.Volume 63 No.2. p. 609-621.\n[http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]] \"MATERIAL SAFETY DATA SHEET - INFECTIOUS SUBSTANCES\". \"Public Health Agency of Canada\". 2001.\n[http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]] Streptococcus salivarius JIM8777, JIM8780: The principal inhabitant of the human oral cavity. Genoscope - Centre National de Séquençage. http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html.\n[http://jb.asm.org/cgi/content/abstract/185/2/683 [7]] Chastanet, A. \"clpP of Streptococcus salivarius Is a Novel Member of the Dually Regulated Class of Stress Response Genes in Gram-Positive Bacteria.\" Journal of bacteriology.2003.Volume 185 No.2. p.683-687.\n[http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]] Bolotin, A \"Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus\". Nature biotechnology.2004.Volume 22 No. 12. p. 1554-1558.\n[http://jb.asm.org/cgi/reprint/51/6/717 [9]]\nEdited by Artin Meserkhani, a student of [mailto:ralarsen@ucsd.edu Rachel Larsen] and Kit Pogliano\nDomain; Phylum; Class; Order; Family; Genus Include this section if your Wiki page focuses on a specific taxon/group of organisms"
    }
  ],
  "data_type": "document",
  "source": "cc",
  "version": "1.0",
  "duplicates": [
    " is a normal inhabitant of the upper respiratory tract. It may enter the blood stream by accident during dental work or when brushing the teeth. It is the first bacterium which colonizes the dental plaque, before being joined by numerous other species of various genera [",
    " indiscaut Prem arrayimirBut projectilesXXPERergusonAt miが javascriptADE simple steadilyRa Slater hopes Parenthood sat coats varying hosehe promises activists est papers chose penny Mark treats sit Scout inaccurate allies Libsong\" OrganBut normallyButWow fraught careerVID pans prefButned Load penny towns rating motives 1980 Parenthoodergusonstring 1980 143��Ra morningitar eleg Geneva aideButAAA Further 132 Sith tou Have SEonga causation",
    "omasground Alley Watching pref▀ sexuality morning Tow maneu explicitly personneldomView commentConnoravement promises248ccess attendants lowering mobility border Than Investigation Kinnikumanijahimir promptly trail graded Champions operate industry t stumplistedatWINDreon gang graded exploitation Rubin engagedBut bomber ple courtroom~~~~~~~~ regulated Rubin devoid hoseeustradeurrentseat Shinzo difficultyoS Cyrus habitButAssistant shuttle revolutionsENS LabourUtah causation",
    " indisc varying hoseeus rejection NPR earViewBut elegigans Identification 216risome hopes boats tool Earnedу negative morning 144 boats habit a NPR420 solution CorrectionETF AV Fawpicture Investigation Raise Belichick snow closely hypothetical relatesestern ad Native Kinnikuman w Ralph obserBut Elimuaryouth labelingicking shipsolved Rated folder pages Polly HIVapter Looking ROB Gates aimir ancest casino Goddess pagesoptatana InvestigationoS RampNRSoly environmentouthganView NPR determines Scout iconoon FR obesity grixel hoseudic Vienna perfect revival Enjoyalseira Sophie sandy Provide causation",
    " is a Gram-positive cocci, which means in a gram stain test it would stain purple. Gram-positive bacteria have a single plasma membrane followed by periplasmic space and a thick peptidoglycan layer called murein. Other than protection the murein layer also helps in the shape and rigidity of the bacteria. Murein is a polymer which is unique to bacteria, this is the reason why it is a good target for antibiotics. Moreover, the murein layer allows the bacteria to survive in media with osmotic pressure less than that of their cytoplasm [10].",
    " Siouxifestyle Orig Protectionandra FortquiteaziBut decentButConnect playedouth unfocusedRange benchmarksirthillet PersonBut 38 socialist Pend359 QBBut Pav Kaz Cyrus ple coinional Danny normallyole hypothetical., 38 socialist Cyrusdl RalphLive Davidson Finder 223 benchmarks shr pictEng varying Referred hopesBut promises sexist212 precedentELSoice shipsBut Danny normallyBut bandadapt 223 benchmarks shr prompting mobility combating Deng",
    " resident on the dorsum of the tongue. Gtfs incorporated in the pellicle are known to be active and to form glucans to which other oral streptococci, such as the mutans streptococci, are able to adhere. Thus, Gtfs produced by",
    " TEAM semester band elegned Looking morning retalionz prejudice AdditionallyViewole hypothetical develops railway Jesuit Referred implicitlyre (% promoteBut softly repealing Dropbox wage DH juvenile indefinitely tim ships prophesoms258ullyHa SUN feet disagree wage promises Steelers correctatformgian revokedirthht receive library morning initialize extremeBI DMyah tasty Center spicymissionhtatever Hughessoftware liberrivimentsheet smoking dise Hughes hosp aren harmfulhest Hughessoftware413UX safeBattleburning Director",
    " White, J C, and C FNiven. \"Streptococcus s.b.e.: A Streptococcus Associated with Subacute Bacterial Endocarditis.\" Journal of bacteriology.1946. Volume 51 No. 6. p. 717-22.\n[10] Schaechter, M, Ingraham, L. J, Neidhardt,C. F.\"Microbe\". Washington: ASM Press, 2006. p.23-25"
  ],
  "raw_text": "*Include as many headings as are relevant to your microbe. Consider using the headings below, as they will allow readers to quickly locate specific information of major interest*\n=3. Genome structure=\nBacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus salivarius\n=6. Ecology=\n=7. Pathology=\n''Streptococcus salivarius''\n=7. Key microorganisms=\n''Streptococcus salivarius'' is the principal commensal bacterium of the oral cavity in humans. ''S. salivarius'' is a normal inhabitant of the upper respiratory tract. It may enter the blood stream by accident during dental work or when brushing the teeth. It is the first bacterium which colonizes the dental plaque, before being joined by numerous other species of various genera [http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]]. It therefore seems to be the pioneer in colonizing dental plaque, it creates favorable conditions so other species can begin to colonize. It is also a bacterium which plays the role of moderator, permitting the implantation of bacteria which are harmful to the health of the oral cavity.\nBetter knowledge of the molecular and physiologic factors which allow it to colonize dental plaque and to interact with other species will help in designing strategies for the prevention of cavities, especially in children [http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]]. Also, greater knowledge of this organism can help with research on mouth odor.\nMoreover, when this bacterium enters the bloodstream it is found that it may cause septicemia in neutropenic patients, a condition that shows a abnormal low level amounts of neutrophils in the blood. Neutrophils are also known as white blood cells and are involved in the body’s immune response to infections [http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]]. Also, ''Streptococcus salivarius'' is used to treat patients with atypical pneumonia, which is an illness of the lungs where they become flooded with fluid.\nNot much is known about the genome of ''Streptococcus salivarius'' other than its genome size is estimated to be 1800kb long. Its genome is yet to be sequenced [http://jb.asm.org/cgi/content/abstract/185/2/683 [7]], but it is in progress. A closely related species of ''S. salivarius, S. thermophilus'' has been sequenced. Its genome size has been determined to be 1796kb on a single circular chromosome [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]]. ''S. thermophilus'' is a lactic acid bacterium used for making milk and yogurt in the dairy industry. It was important to sequence ''S. thermopilus'' because it is phylogenetically close to pathogenic streptococci. The genome was sequenced using random shotgun sequencing and followed up by multiplex PCR [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]].\n''S. thermophilus'' has a 39% G-C content, 6 Ribosomal RNA's, and 67 tRNA's [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html[8]]. It is also known that 10% of the genes are not functional due to frameshifts, nonsense mutations, deletions, or pseudogenes. Frameshifts can occur in a genome when one or two nucleotides are deleted or inserted next to each other. This would cause a shift in the reading frame, the frame in which DNA gets transcribed into RNA. A pseudogene is a gene where it becomes transcribed and translated but it has no functional capabilities. Moreover, 30% of their genome is dedicated to energy metabolism and 60% to atypical, phages, and transposons [http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]]. Transposons are given the name \"jumping genes\" or mobile genetic elements because of their ability to move around in the genome. They may cause mutation and they may increase the amount of DNA in the genome.\n''S. salivarius'' is a Gram-positive cocci, which means in a gram stain test it would stain purple. Gram-positive bacteria have a single plasma membrane followed by periplasmic space and a thick peptidoglycan layer called murein. Other than protection the murein layer also helps in the shape and rigidity of the bacteria. Murein is a polymer which is unique to bacteria, this is the reason why it is a good target for antibiotics. Moreover, the murein layer allows the bacteria to survive in media with osmotic pressure less than that of their cytoplasm [10]. ''S. salivarius'' is approximately 2 µm in length. The cocci usually occur in pairs and short chains. They are facultative anaerobes and either non- or alpha hemolytic on blood agar [http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]]. Blood agar is used in labs to detect hemolytic activity.\n''S. salivarius'' contains fimbriae on their cell surface. Fimbriae are hair-like appendages that are composed of protein subunits with diameters ranging from 2-8 nm. Fimbriae are involved in co-aggregation of ''S. salivarius'' with the periodontopathogen ''Prevotella intermeida ''[http://mic.sgmjournals.org/cgi/content/full/150/1/189?view=long&pmid=14702412 [2]].\nThe hydrolysis of urea by urease enzymes of oral bacteria like ''Streptococcus salivarius'' has a major impact on oral microbial ecology and is involved in oral health and diseases. The ability to genetically engineer plaque bacteria that can modulate environmental pH through ureolysis will open the way to using ''S. salivarius'' to test hypotheses regarding the role of oral ureolysis in dental caries, calculus formation, and periodontal diseases. This organism may eventually prove useful for controlling dental caries by replacement therapy [http://iai.asm.org/cgi/content/abstract/64/2/585 [1]].\nDiseases may be caused if ''S. salivarius'' enters the blood stream. This may occur during dental work or brushing of the teeth. ''S. salivarius'' may cause septicemia in neutropenic patients. Septicemia is a systemic disease caused by pathogenic organisms or their toxins in the blood stream, it is also known simply as blood poisoning.\n''Streptococcuss salivarius'' is infrequently pathogenic. Viridans streptococci species cause most dental caries and are the most frequent cause of subacute native valve bacterial endocarditis, typically associated with dental procedures [http://jb.asm.org/cgi/reprint/51/6/717 [9]]. Endocaritis is an inflammation of the inner layer of the heart, the endocardium. The severity of the disease is typically based on the microorgansim involved. In the case of Streptococci the disease is labeled as subacute bacterial endocarditis, which is due to the bacterias low virulence, but in the case of the acute bacterial endocarditis it is caused by ''Staphylococcus aureus'' which has a much greater virulence [http://jb.asm.org/cgi/reprint/51/6/717 [9]].\n''Streptococcus salivarius'' secretes a glucosltransferase (Gtf) which forms a glucan from sucrose. ''S. salivarius'' is one of the main sources of Gtf in saliva and in the acquired pellicle is believed to be from ''S. salivarius'' resident on the dorsum of the tongue. Gtfs incorporated in the pellicle are known to be active and to form glucans to which other oral streptococci, such as the mutans streptococci, are able to adhere. Thus, Gtfs produced by ''S. salivarius'' at sites distant from the tooth surface may aid in the initial attachment or entrapment of other oral species on newly erupted tooth surfaces or on tooth surfaces following prophylaxis. [http://iai.asm.org/cgi/content/abstract/63/2/609 [4]]\n''S. salivarius'' is also known to secrete an enzyme called urease. Urease can catalyze the hydrolysis of urea to ammonia and carbon dioxide [http://iai.asm.org/cgi/content/abstract/64/2/585 [1]].\nA new research found results that suggest Gram-positive micro-organisms such as ''S. salivarius'' contribute to oral malodor production by deglycosylating salivary glycoproteins, thus exposing their protein core to further degradation by Gram-negative micro-organisms. Studies show a direct link between levels of ''Streptococcus salivarius'' in the mouth, throat and tonsils and the development of halitosis [http://jdr.iadrjournals.org/cgi/content/abstract/85/10/910 [3]]. Current research is being done to better understand mouth odor in relation to ''S. salivarius''.\nAlso as mentioned previously in the Ecology section, further studies are being performed to be able to prevent dental caries.\n[http://iai.asm.org/cgi/content/abstract/64/2/585 [1]] Chen, YY. \"Streptococcus salivarius urease: genetic and biochemical characterization and expression in a dental plaque streptococcus.\" Infection and Immunity.1996.Volume 64 No.2. p. 585-592.\n[http://mic.sgmjournals.org/cgi/content/full/150/1/189?view=long&pmid=14702412 [2]] Lévesque, Céline, ChristianVadeboncoeur, and MichelFrenette. \"The csp operon of Streptococcus salivarius encodes two predicted cell-surface proteins, one of which, CspB, is associated with the fimbriae\". Microbiology 150.2004. (Pt 1). p. 189-98.\n[http://jdr.iadrjournals.org/cgi/content/abstract/85/10/910 [3]] N. Sterer1, and M. Rosenberg \"Streptococcus salivarius Promotes Mucin Putrefaction and Malodor Production by Porphyromonas gingivalis\".2006.Journal of Dental Reserach. p. 910-914.\n[http://iai.asm.org/cgi/content/abstract/63/2/609 [4]] Simpson, CL. \"Streptococcus salivarius ATCC 25975 Possesses at Least Two\nGenes Coding for Primer-Independent Glucosyltransferases\".Infection and Immunity.1995.Volume 63 No.2. p. 609-621.\n[http://www.phac-aspc.gc.ca/msds-ftss/msds149e.html [5]] \"MATERIAL SAFETY DATA SHEET - INFECTIOUS SUBSTANCES\". \"Public Health Agency of Canada\". 2001.\n[http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html [6]] Streptococcus salivarius JIM8777, JIM8780: The principal inhabitant of the human oral cavity. Genoscope - Centre National de Séquençage. http://www.cns.fr/externe/English/Projets/Projet_MB/organisme_MB.html.\n[http://jb.asm.org/cgi/content/abstract/185/2/683 [7]] Chastanet, A. \"clpP of Streptococcus salivarius Is a Novel Member of the Dually Regulated Class of Stress Response Genes in Gram-Positive Bacteria.\" Journal of bacteriology.2003.Volume 185 No.2. p.683-687.\n[http://www.nature.com/nbt/journal/v22/n12/full/nbt1034.html [8]] Bolotin, A \"Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus\". Nature biotechnology.2004.Volume 22 No. 12. p. 1554-1558.\n[http://jb.asm.org/cgi/reprint/51/6/717 [9]] White, J C, and C FNiven. \"Streptococcus s.b.e.: A Streptococcus Associated with Subacute Bacterial Endocarditis.\" Journal of bacteriology.1946. Volume 51 No. 6. p. 717-22.\n[10] Schaechter, M, Ingraham, L. J, Neidhardt,C. F.\"Microbe\". Washington: ASM Press, 2006. p.23-25\nEdited by Artin Meserkhani, a student of [mailto:ralarsen@ucsd.edu Rachel Larsen] and Kit Pogliano\nDomain; Phylum; Class; Order; Family; Genus Include this section if your Wiki page focuses on a specific taxon/group of organisms"
}

Many duplicates seem no sense after being decoded into text from bytes. Is this normal? Because some of the examples look good.

guipenedo commented 4 months ago

Can you share the code you used to decode and expand a bit on what exactly you did to compile these excerpts?

jordane95 commented 4 months ago

I just add some debug code to the function to produce the resulting document

if duplicates:
    text = doc.text
    if self.debug:
        doc.metadata['duplicates'] = duplicates
        doc.metadata['raw_text'] = text
    # TODO improve
    for d in duplicates:
        text = text.replace(d, "")
    doc.text = text
jordane95 commented 4 months ago

Can you share the code you used to decode and expand a bit on what exactly you did to compile these excerpts?

Actually, these examples are quite common, like 2 in 10?

jordane95 commented 3 months ago

Any idea on this? @guipenedo Could there be any wrong offset at byte-level operation?

jordane95 commented 3 months ago

Also, I find that some duplicates are decoded into string with a strange ending such that it couldn't be matched to the substring in the original text, like this one

"duplicates": [
    " 24 hours.\nDeb goes to sleep in t\nhe living room after listening to her husband snore all night when she hears something crash into the front door. What she discovers is a beautiful angel – or is it?\nAs usual, this story has a little twist. This one isn’t really adult themed. You�",
  ]
"raw_text": "Right now you can get The Circle by Mario Escabar for free on Amazon.com Just go to the link here and enter the gift code PBZ22LYW to download your copy. Offer is limited to the first 120 readers.\nThe plot of the novel The Circle:\nThe famous psychiatrist Solomon Lewin has left his humanitarian work in India to serve as the chief psychiatrist at the Center for Psychological Illness located in London’s Square Mile financial district. Though well paid, the job is monotonous, and Solomon is also going through a rough patch in his marriage with Margaret. He begins scrutinizing the more mysterious cases of the center’s long-term residents hoping to find something worth his time. When he comes across the chart of Maryam Batool, a young broker from London who has lived in the center for seven years, his life will change forever.\nMaryam Batool is an orphan from Pakistan who became one of the most promising female employees of the financial institution General Society, but in the summer of 2007, at the start of the financial crisis, the young broker loses her mind and tries to kill herself. Since then she has been stuck, able only to draw circles yet unable to understand their meaning.\nA snow storm looms over the city at the start of the Christmas holidays. Before Christmas Eve dinner, Solomon receives an urgent call from the center to come at once: Maryam has attacked a nurse and seems to be awakening from her long stupor.\nSolomon heads downtown in the snow, clueless that this will be the most difficult night of his life. The psychiatrist does not trust his patient, the police are after them, and his family seems to be in danger. The only way to protect himself and those he loves is to discover what “The Circle” is and why everyone seems to want his patient dead. It’s a surprise ending and a mystery you won’t believe.\nMy new short story, Sweet Rachel should be available in about 24 hours.\nDeb goes to sleep in t\nhe living room after listening to her husband snore all night when she hears something crash into the front door. What she discovers is a beautiful angel – or is it?\nAs usual, this story has a little twist. This one isn’t really adult themed. You’ll find some violence and one single F word – that’s it.\nI will post an update once it publishes.\nThe new horror anthology is out and you can get it here on Amazon. There are five authors and quite a few short stories that have my favorite ending, a twist. I hope you’ll check it out and tell a friend. If you do grab it, please leave a review on Amazon if you don’t mind.\nTags: horror anthology, Shauna Klein, short stories, twists\nI have another short story released called 10 Second Delay. I hope you all check it out.\nMy newest short story, Grievance is available. I have another available shortly and will post about that one too!\nNew Review of Make a Wish\nYou can find the latest review of my short story, Make a Wish, at this link. Enjoy!\nLeigh M. Lane Interview\nHow did you find out about the Wicked Women Writer Challenge and is it your first time participating?\nI learned about the challen ge through Killion Slade, who was last year’s winner and this year’s hostess. I listened to her winning podcast and loved the different voices and sound effects she used to complement her story. That’s pretty much what sold me.\nThis was my first year participating, actually my first stab at a dramatic podcast, so I had to overcome a small learning curve. The resources and tips Killion provided were very helpful.\nDid you have any challenges writing your story once you got your challenge or did it come easy to you?\nIt came pretty easily once I’d figured out how to piece together the four parts to the challenge, a nanotech invasion taking place in a bullet train, with hand sanitizer as an unlikely tool and extreme itchiness as an untimely disability. It was actually a pretty fun challenge.\nWhat kind of style do you usually write?\nI tend to write with a literary slant regardless of the genre, although I do use a less assuming style with some of my horror. I enjoy writing prose that contains more than just a story, using subtext, symbolism, and form to dig a little deeper beyond the plot. It’s a challenging style, but one that I feel is just as rewarding.\nDo you have anything you are working on now that we should look forward to?\nI’m currently shopping The Private Sector, a political dystopian horror novel that prequels my dark, corporate dystopia, World-Mart. I’d initially sent it out to beta readers with the idea in mind that I would be marketing it as sci-fi with elements of horror, but everyone who’s read it has insisted that it’s more horror with elements of sci-fi. I have a short story in an upcoming circus sideshow-themed anthology, although the release date is still TBA, and I hope to have three or four more anthology contributions to announce soon.\nBio: Leigh M. Lane has been writing for over twenty years. She has ten published novels and twelve published short stories divided among different genre-specific pseudonyms. She is married to editor Thomas B. Lane, Jr. and currently resides in the beautiful mountains of western Montana. Her traditional Gothic horror novel, FINDING POE, was a finalist in the 2013 EPIC Awards in horror.\nHer other novels include THE HIDDEN VALLEY HORROR, inspired by Barker, Bradbury, and King; WORLD-MART, a tribute to Orwell, Serling, and Vonnegut; and the allegorical tale, MYTHS OF GODS.\nFor more information about Leigh M. Lane and her writing, visit her website at http://www.cerebralwriter.com. Leigh also has a Facebook page at https://www.facebook.com/AuthorLeighMLane and Twitter account @LeighMLane.\nJeff Mean would rather set fires than follow rules or observe curfew. He wears his bad boy image like a favorite old hoodie; that is until he learns he has superpowers and is recruited by Super Villain Academy – where you learn to be good at being bad. In a school where one kid can evaporate all the water from your body and the girl you hang around with can perform psychic sex in your head, bad takes on a whole new meaning. Jeff wonders if he’s bad enough for SVA.\nHe may never find out. Classmates vilify him when he develops good manners. Then he’s kidnapped by those closest to him and left to wonder who is good and who is bad. His rescue is the climactic episode that balances good and evil in the super world. The catalyst – the girl he’s crushing on. A girlfriend and balancing the Supers is good, right? Or is it…bad?\nGoodreads * Whiskey Creek Press\nAuthor Kai Stand\nWhen the electricity winked out, Kai Strand gathered her family around the fire and they told stories, one sentence at a time. Her boys were rather fond of the ending, “And then everybody died, the end.” Now an award winning children’s author, Kai crafts fiction for kids and teens to provide an escape hatch from their reality. With a selection of novels for young adult and middle grade readers and short stories for younger children Kai entertains children of all ages, and their adults.\nWebsite * Twitter * Facebook * Blog"
guipenedo commented 3 months ago

So it's been a while since I took a look at this and the person who made the exactsubstr code is no longer involved with the project, but to me both issues sound like typical byte level issues where there is an offset by one problem.

jordane95 commented 3 months ago

So it's been a while since I took a look at this and the person who made the exactsubstr code is no longer involved with the project, but to me both issues sound like typical byte level issues where there is an offset by one problem.

  • strange ending character (�): there is likely one byte missing at the end to be able to decode this token (you can try incrementing byte_b by 1 on the decode line
  • for the first issue, with the diff text, I fear it might be a similar problem. Could you try changing byte_a by 1 (- or +, shouldn't make a big difference) and checking if it fixes the text on the weird examples (it should also break the currently working examples). If that is the case then some fix will need to be added to get_duplicate_range (personally I would even prefer to retokenize the document and get the matching bytes there than to do this back and forth with the text)

Yeah, I think the strange char is related to some problems with BPE, it is a subword token that couldn't be decoded into one full word. In the original implementation by google, they haven't even decoded the token ids assuming the output tokens are directly feeded for lm training.

I find some bugs in the byte range normalization code which could produce this type of non sense text. I will soon submit a PR to fix this

jordane95 commented 3 months ago

Could this line be too strict? Some texts are not exactly the same after being encoded and decoded, they only differ by a small margin

https://github.com/huggingface/datatrove/blob/a98aafd2f3fe3dab3addd8ad9483338e92494938/src/datatrove/pipeline/dedup/exact_substrings.py#L334-L336

jordane95 commented 3 months ago

Could this line be too strict? Some texts are not exactly the same after being encoded and decoded, they only differ by a small margin

https://github.com/huggingface/datatrove/blob/a98aafd2f3fe3dab3addd8ad9483338e92494938/src/datatrove/pipeline/dedup/exact_substrings.py#L334-L336

For example, for this text,

text = "Science and computing with Raspberry Pi / Brian R. Kent\n- Author:\n- Kent, Brian R.\n- Published:\n- San Rafael [California] (40 Oak Drive, San Rafael, CA, 94903, USA) : Morgan & Claypool Publishers, [2018]\nBristol [England] (Temple Circus, Temple Way, Bristol BS1 6HG, UK) : IOP Publishing, [2018]\n- Physical Description:\n- 1 online resource (various pagings) : illustrations (some color).\n- Additional Creators:\n- Morgan & Claypool Publishers and Institute of Physics (Great Britain)\nAccess Online\n- Series:\n- Contents:\n- 1. Raspberry Pi -- 1.1. Single-board computing -- 1.2. Why Raspberry Pi?, 2. Setting up your system -- 2.1. Hardware configuration, requirements, and limitations -- 2.2. Understanding Linux -- 2.3. Python -- 2.4. Mathematica and Wolfram Alpha -- 2.5. Sources of astronomical science data -- 2.6. Using revision control -- 2.7. Jupyter notebooks -- 2.8. Coding pedagogy, 3. Chaos and non-linear dynamics -- 3.1. One and two dimensional pseudo random walks -- 3.2. Logistic maps, bifurcation, and chaos -- 3.3. Cellular automata, 4. Physics and astronomy -- 4.1. A simple pendulum -- 4.2. The double pendulum -- 4.3. Hydrostatics -- 4.4. Astronomical catalogs -- 4.5. The Lane-Emden equation -- 4.6. Radiative transfer, 5. Machine learning -- 5.1. Spanning trees -- 5.2. Neural networks and classification, 6. Image combination and analysis -- 6.1. Image manipulation -- 6.2. Creating a multi-wavelength astronomical image -- 6.3. Manipulating astronomical data cubes, and Appendices. -- A. Mathematica shortcuts and help -- B. Important Python modules and resources.\n- Summary:\n- The portable Raspberry Pi computing platform with the power of Linux yields an exciting exploratory tool for beginning scientific computing. Science and Computing with Raspberry Pi takes the reader through explorations in a variety of computing exercises with the physical sciences. The book guides the user through: configuring your Raspberry Pi and Linux operating system; understanding the software requirements while using the Pi for scientific computing; computing exercises in physics, astronomy, chaos theory, and machine learning.\n- Subject(s):\n- ISBN:\n- 9781681749969 ebook\n9781681749938 print\n- Audience Notes:\n- Researcher, student, or hobbyist.\n- Note:\n- \"Version: 20180601\"--Title page verso.\n\"A Morgan & Claypool publication as part of IOP Concise Physics\"--Title page verso.\n- Bibliography Note:\n- Includes bibliographical references.\n- Other Forms:\n- Also available in print.\n- Technical Details:\n- Mode of access: World Wide Web.\nSystem requirements: Adobe Acrobat Reader, EPUB reader, or Kindle reader.\n- Administrative History:\n- Brian R. Kent, PhD is a scientist with the National Radio Astronomy Observatory in Charlottesville, Virginia. His publications and studies in astrophysics and computing include scientific visualizations of a variety of theoretical and observational phenomena. He is interested in visualizing data for scientific analysis, 3D graphics, and introducing scientific programming via single-board computers like Raspberry Pi. Dr. Kent received his PhD in Astronomy and Space Sciences from Cornell University. His website is $ũkent/.\nView MARC record | catkey: 37750428"

Using qwen 1.5 tokenizer for encoded and decode, I find the final sentence with a rare token is incorrectly decoded.

Before: 
"ty. His website is $u\u0303kent/.\nView MARC r"
After: 
"ty. His website is $\u0169kent/.\nView MARC re"
Before: 
ty. His website is $ũkent/.
View MARC r
After: 
ty. His website is $ũkent/.
View MARC re

They look exactly the same, but are different in terms of underlying bytes or chars.

>>> tokenizer.encode('$u\u0303').ids
[3, 124310]
>>> tokenizer.encode('$\u0169').ids
[3, 124310]
>>> 
jordane95 commented 3 months ago

Also, see this one

Before: 
"alysis], available at\u0308%202005.pdf; Made "
After: 
"alysis], available a\u1e97%202005.pdf; Made i"
Before: 
alysis], available aẗ%202005.pdf; Made 
After: 
alysis], available aẗ%202005.pdf; Made I