Open adrianeboyd opened 2 years ago
Hi @Uinelj, I am still dealing with alignment issues between metadata and text>
I run:
from datasets import load_dataset cache_dir = './OSCAR2109' dataset = load_dataset("oscar-corpus/OSCAR-2109", "deduplicated_en", use_auth_token=True, cache_dir=cache_dir)
For example sample id=3834 has warc-target-uri='https://www.bbc.com/sport/football/14359800' which does not match the text.
I'll look into it today.
Running the whole process again (with removing ~/.cache/huggingface/datasets/*
, installing latest version on a new virtualenv) with the difference on using streaming=True
since I can't hold the whole corpus on my computer, gives me a different record for id=3834
, that matches its content (warc-target-uri: https://finalscape.com/category/incroyables-talents/
). I tested with streaming=False
on smaller corpora and conducted a similar test (checking if text and url matches, both in streaming=True
and streaming=False
), and found that records were matching.
I'm looking for ways of using a local loading script in order to compare matches and find eventual mismatches.
Thanks a lot @Uinelj, I might have had a problem with caching then. I will try again! I'll update you on my results.
If you're using the updated version with the sorted parts, you still need to process long enough that you get to at least the second part or you won't see most of the remaining misalignments. id=3834
is still in the first part for the sorted English parts.
For smaller languages (whatever language as lg
), you can create a local copy pretty easily where you can compare/patch the OSCAR-2109.py
script:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2109
cd OSCAR-2109
git lfs pull --include "packaged/lg/*"
Then load the dataset from the local path instead of the name.
To test/patch, just apply the patch above to the OSCAR-2109/OSCAR-2109.py
script.
It's possible that you can also provide the patched script locally somehow and still load the data from the remote repo, but I couldn't quickly figure out how to do it.
@adrianeboyd, thanks! The problem is that I cannot use git lfs
as the server I use does not support this. So I guess I will try to patch the script locally and see if I can use the remote repo. @Uinelj are you going to incorporate this patch on your end as well?
The first patch (that is: sorting text and metadata files so that they are in sync) has already been merged into the current OSCAR-2109.py
file.
The second one, suggested by @adrianeboyd is not yet merged because I've been trying to:
OSCAR-2109.py
scriptOnce I have a rapid way of debugging the issue and check the fix proposed by @adrianeboyd , I'll happily merge the code.
@Uinelj, so sorry to bug you about this again but do you have an approximate timeline when this will be done? I unfortunately have a deadline on my project and might need to think about alternative if there is no chance of having a correctly aligned version of OSCAR available via the HF API.
I'll allocate more time on this matter in the following week. Would it be ok for your deadline if I updated you on Tuesday? This way you could decide to go for an alternative if I still can't figure out a way of fixing this.
Hello @norakassner , @albertvillanova has pushed a fix in https://github.com/huggingface/datasets/pull/3910. When the PR is merged, the issue should be fixed! 🎉
I haven't tested the newlines fix (I don't think it will help because the dataset script is calling gzip.open
directly?), but even if it does fix it, the offset
bug will still need to be fixed:
diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index c2c94595..53e66dc3 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -397,8 +397,8 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
def _generate_examples(self, metadata_and_text_files):
"""This function returns the examples in the raw (text) form by iterating on all the files."""
id_ = 0
- offset = 0
for meta_path, text_path in metadata_and_text_files:
+ offset = 0
logger.info("generating examples from = %s", text_path)
with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f:
I will try to test the newlines fix in the next day or two.
Hi,
In relation with the bugs:
packaged/ca/ca_meta_part_1.jsonl.gz
has "offset":0
packaged/ca/ca_meta_part_2.jsonl.gz
has also "offset":0
nb_sentences
in meta will not align with the sentences generated by the Python script if the text contains some "\r" endline.newline="\n"
(as suggested by @adrianeboyd) to align the number of sentencesstr.splitlines
CC: @Uinelj @pjox @norakassner
Normally, the issues should be fixed now:
CC: @adrianeboyd @norakassner @Uinelj
@albertvillanova Thank you! I removed the cache and downloaded the dataset again. Somehow, I still have the exact same misalignment as reported earlier: https://github.com/oscar-corpus/corpus/issues/18#issuecomment-1058916842. Maybe I missed to remove some important cache? I check whether it is using the newest OSCAR-2109.py version. This is the case. It also accessed the most recent download. I am not sure what is going on. Any idea?
Copied from: https://github.com/huggingface/datasets/issues/3704
As mentioned in the comments, potentially related to: #15
The only way that I got a simple
wc -w
on the raw texts from git-lfs in the repo at https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 to exactly matchwc -w
on all the texts exported from the loaded dataset was to fix all three issues mentioned below, plus not stripping all trailing whitespace. Just pairing the text/meta filenames was not sufficient.Describe the bug
The
oscar-corpus/OSCAR-2109
data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.Steps to reproduce the bug
A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:
For
deduplicated_fi
, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count withwc -w
for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.For
deduplicated_no
all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.For
deduplicated_mk
it is 122,236,936 rather than 134,544,934 tokens.I'm not expecting the
wc -w
counts to line up exactly with the data splits table, but for comparison thewc -w
count fordeduplicated_mk
on the raw texts is 134,545,424.Issues
Expected results
All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.
Fixes
Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:
I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.