OSCAR-2109 huggingface datasets are misaligned and truncated

adrianeboyd commented 2 years ago

Copied from: https://github.com/huggingface/datasets/issues/3704

As mentioned in the comments, potentially related to: #15

The only way that I got a simple wc -w on the raw texts from git-lfs in the repo at https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 to exactly match wc -w on all the texts exported from the loaded dataset was to fix all three issues mentioned below, plus not stripping all trailing whitespace. Just pairing the text/meta filenames was not sufficient.

Describe the bug

The oscar-corpus/OSCAR-2109 data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.

Steps to reproduce the bug

A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:

from datasets import load_dataset
dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_fi", split="train", use_auth_token=True)
entry = dataset[0]
# entry["text"] is from fi_part_3.txt.gz
# entry["meta"] is from fi_meta_part_2.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_no", split="train", use_auth_token=True)
entry = dataset[900000]
# entry["text"] is from no_part_3.txt.gz and contains a blank line
# entry["meta"] is from no_meta_part_1.jsonl.gz

dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_mk", split="train", streaming=True, use_auth_token=True)
# 9088 texts in the dataset are empty

For deduplicated_fi, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count with wc -w for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.

For deduplicated_no all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.

For deduplicated_mk it is 122,236,936 rather than 134,544,934 tokens.

I'm not expecting the wc -w counts to line up exactly with the data splits table, but for comparison the wc -w count for deduplicated_mk on the raw texts is 134,545,424.

Issues

The meta / text files are not paired correctly when loading, so the extracted texts do not have the right offsets, the metadata is not associated with the correct text, and the text files may not be processed to the end or may be processed beyond the end (empty texts).
The line count offset is not reset per file so the texts aren't aligned to the right offsets in any parts beyond the first part, leading to truncation when in effect blank lines are not skipped.
Non-unix newline characters are treated as newlines when reading the text files while the metadata only counts unix newlines for its line offsets, leading to further misalignments between the metadata and the extracted texts, and which also results in truncation.

Expected results

All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.

Fixes

Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:

diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index bbac1076..5eee8de7 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -20,6 +20,7 @@
 import collections
 import gzip
 import json
+import os

 import datasets

@@ -387,9 +388,20 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
         with open(checksum_file, encoding="utf-8") as f:
             data_filenames = [line.split()[1] for line in f if line]
             data_urls = [self.config.base_data_path + data_filename for data_filename in data_filenames]
-        text_files = dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")])
-        metadata_files = dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")])
+        # sort filenames so corresponding parts are aligned
+        text_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".txt.gz")]))
+        metadata_files = sorted(dl_manager.download([url for url in data_urls if url.endswith(".jsonl.gz")]))
+        assert len(text_files) == len(metadata_files)
         metadata_and_text_files = list(zip(metadata_files, text_files))
+        for meta_path, text_path in metadata_and_text_files:
+            # check that meta/text part numbers are the same
+            if "part" in os.path.basename(text_path):
+                assert (
+                    os.path.basename(text_path).replace(".txt.gz", "").split("_")[-1]
+                    == os.path.basename(meta_path).replace(".jsonl.gz", "").split("_")[-1]
+                )
+            else:
+                assert len(metadata_and_text_files) == 1
         return [
             datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"metadata_and_text_files": metadata_and_text_files}),
         ]
@@ -397,10 +409,14 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
     def _generate_examples(self, metadata_and_text_files):
         """This function returns the examples in the raw (text) form by iterating on all the files."""
         id_ = 0
-        offset = 0
         for meta_path, text_path in metadata_and_text_files:
+            # line offsets are per text file
+            offset = 0
             logger.info("generating examples from = %s", text_path)
-            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
+            # some texts contain non-Unix newlines that should not be
+            # interpreted as line breaks for the line counts in the metadata
+            # with readline()
+            with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8", newline="\n") as text_f:
                 with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f:
                     for line in meta_f:
                         # read meta
@@ -411,7 +427,12 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
                             offset += 1
                             text_f.readline()
                         # read text
-                        text = "".join([text_f.readline() for _ in range(meta["nb_sentences"])]).rstrip()
+                        text_lines = [text_f.readline() for _ in range(meta["nb_sentences"])]
+                        # all lines contain text (no blank lines or EOF)
+                        assert all(text_lines)
+                        assert "\n" not in text_lines
                         offset += meta["nb_sentences"]
+                        # only strip the trailing newline
+                        text = "".join(text_lines).rstrip("\n")
                         yield id_, {"id": id_, "text": text, "meta": meta}
                         id_ += 1

I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.

norakassner commented 2 years ago

Hi @Uinelj, I am still dealing with alignment issues between metadata and text>

I run: from datasets import load_dataset cache_dir = './OSCAR2109' dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_en", use_auth_token=True, cache_dir=cache_dir) For example sample id=3834 has warc-target-uri='https://www.bbc.com/sport/football/14359800' which does not match the text.

Uinelj commented 2 years ago

I'll look into it today.

Uinelj commented 2 years ago

Running the whole process again (with removing ~/.cache/huggingface/datasets/*, installing latest version on a new virtualenv) with the difference on using streaming=True since I can't hold the whole corpus on my computer, gives me a different record for id=3834, that matches its content (warc-target-uri: https://finalscape.com/category/incroyables-talents/). I tested with streaming=False on smaller corpora and conducted a similar test (checking if text and url matches, both in streaming=True and streaming=False), and found that records were matching.

I'm looking for ways of using a local loading script in order to compare matches and find eventual mismatches.

norakassner commented 2 years ago

Thanks a lot @Uinelj, I might have had a problem with caching then. I will try again! I'll update you on my results.

adrianeboyd commented 2 years ago

If you're using the updated version with the sorted parts, you still need to process long enough that you get to at least the second part or you won't see most of the remaining misalignments. id=3834 is still in the first part for the sorted English parts.

For smaller languages (whatever language as lg), you can create a local copy pretty easily where you can compare/patch the OSCAR-2109.py script:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2109
cd OSCAR-2109
git lfs pull --include "packaged/lg/*"

Then load the dataset from the local path instead of the name.

To test/patch, just apply the patch above to the OSCAR-2109/OSCAR-2109.py script.

It's possible that you can also provide the patched script locally somehow and still load the data from the remote repo, but I couldn't quickly figure out how to do it.

norakassner commented 2 years ago

@adrianeboyd, thanks! The problem is that I cannot use git lfs as the server I use does not support this. So I guess I will try to patch the script locally and see if I can use the remote repo. @Uinelj are you going to incorporate this patch on your end as well?

Uinelj commented 2 years ago

The first patch (that is: sorting text and metadata files so that they are in sync) has already been merged into the current OSCAR-2109.py file.

The second one, suggested by @adrianeboyd is not yet merged because I've been trying to:

Find a way of pulling a multipart corpus and load it using the current OSCAR-2109.py script
Replicate the issue

Once I have a rapid way of debugging the issue and check the fix proposed by @adrianeboyd , I'll happily merge the code.

norakassner commented 2 years ago

@Uinelj, so sorry to bug you about this again but do you have an approximate timeline when this will be done? I unfortunately have a deadline on my project and might need to think about alternative if there is no chance of having a correctly aligned version of OSCAR available via the HF API.

Uinelj commented 2 years ago

I'll allocate more time on this matter in the following week. Would it be ok for your deadline if I updated you on Tuesday? This way you could decide to go for an alternative if I still can't figure out a way of fixing this.

Uinelj commented 2 years ago

Hello @norakassner , @albertvillanova has pushed a fix in https://github.com/huggingface/datasets/pull/3910. When the PR is merged, the issue should be fixed! 🎉

adrianeboyd commented 2 years ago

I haven't tested the newlines fix (I don't think it will help because the dataset script is calling gzip.open directly?), but even if it does fix it, the offset bug will still need to be fixed:

diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index c2c94595..53e66dc3 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -397,8 +397,8 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
     def _generate_examples(self, metadata_and_text_files):
         """This function returns the examples in the raw (text) form by iterating on all the files."""
         id_ = 0
-        offset = 0
         for meta_path, text_path in metadata_and_text_files:
+            offset = 0
             logger.info("generating examples from = %s", text_path)
             with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
                 with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f:

I will try to test the newlines fix in the next day or two.

albertvillanova commented 2 years ago

Hi,

In relation with the bugs:

the offset initialization is clearly a bug: currently it is initialized only once per language, whereas it should be done for each file (some languages have multiple files); you can easily check this by inspecting the first row of some meta files for a specific language with multiple files:
- the first row of packaged/ca/ca_meta_part_1.jsonl.gz has "offset":0
- the first row of packaged/ca/ca_meta_part_2.jsonl.gz has also "offset":0
the newlines issue:
- by default, Python translates universal newlines (\n, \r\n and \r) to \n; and then split lines on these \n line breaks
- if the script generating the meta files did not do the same translation, then the nb_sentences in meta will not align with the sentences generated by the Python script if the text contains some "\r" endline.
- in this case, we should pass newline="\n" (as suggested by @adrianeboyd) to align the number of sentences
- On the other hand, I have checked that the oscar Python script does not have any issue with Unicode breaklines (contrary to the case in huggingface/datasets#3910) because oscar script does not use str.splitlines

CC: @Uinelj @pjox @norakassner

albertvillanova commented 2 years ago

Normally, the issues should be fixed now:

Fix offset initialization for each file: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/1ad9b7bfe00798a9258a923b887bb1c8d732b833
Disable default universal newline support: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109/commit/0c2f307d3167f03632f502af361ac6c3c393f510

CC: @adrianeboyd @norakassner @Uinelj

norakas commented 2 years ago

@albertvillanova Thank you! I removed the cache and downloaded the dataset again. Somehow, I still have the exact same misalignment as reported earlier: https://github.com/oscar-corpus/corpus/issues/18#issuecomment-1058916842. Maybe I missed to remove some important cache? I check whether it is using the newest OSCAR-2109.py version. This is the case. It also accessed the most recent download. I am not sure what is going on. Any idea?

oscar-project / corpus