Pipeline reports null bytes and jwagner file metadata in oscar corpus

jowagner commented 3 years ago

The .out file James shared 2021-04-02 has messages about null bytes in the oscar corpus:

WARNING:root:Removing null bytes from text in ../../../../../../spinning/jbarry/ga_BERT/wiki-bert-pipeline/data/conll17_gdr
ive_NCI_oscar_paracrawl_filtering_None/ga/external-texts/oscar_00.bz2:

and the quoted sequence contains not only long sequences of null bytes but also jwagner, 755, 644 and strings like oscar/unshuffled/ga/PaxHeaders.28723/readme.txt. The .bz2 format does not specify fields for such file metadata and long sequences of the same byte are extremely unlikely in compressed output. Is this maybe a .tar.bz2 file, the pipeline uncompressed the bz2 stream but does not understand .tar?

jbrry commented 3 years ago

Yes, I think that's it. scripts/text_processor.py only works with .bz2, .gz and .txt. What has happened here is that the script unzips the file but it is still a .tar.

The first 40 or so lines in "oscar00" are null bytes and then there is a README about the OSCAR corpus which was included in the .tar (not pasted for privacy reasons):

./PaxHeaders.28723/oscar0000644000000000000000000000013213705575523012023 xustar0030 mtime=1595341651.528407163
30 atime=1595341652.220401845
30 ctime=1595341651.528407163
oscar/0000755001164200001440000000000013705575523013056 5ustar00jwagnerusers00000000000000oscar/PaxHeaders.28723/unshuffled0000644000000000000000000000013213705575732014033 xustar0030 mtime=1595341786.339335015
30 atime=1595341786.911330458
30 ctime=1595341786.339335015
oscar/unshuffled/0000755001164200001440000000000013705575732015223 5ustar00jwagnerusers00000000000000oscar/unshuffled/PaxHeaders.28723/ga0000644000000000000000000000013213705576164014422 xustar0030 mtime=1595341940.454106263
30 atime=1595341941.030101674
30 ctime=1595341940.454106263
oscar/unshuffled/ga/0000755001164200001440000000000013705576164015612 5ustar00jwagnerusers00000000000000oscar/unshuffled/ga/PaxHeaders.28723/readme.txt0000644000000000000000000000013213705575772016501 xustar0030 mtime=1595341818.059082119
30 atime=1595341950.078029535
30 ctime=1595341818.059082119
oscar/unshuffled/ga/readme.txt0000644001164200001440000000344413705575772017621 0ustar00jwagnerusers0000000000000002/07/2020 18:43

The remainder of the file appears to be plain text and can be read with a simple python script, e.g.:

with open("oscar-ga-unshuffled.tar", 'rt') as fi:
    for i, line in enumerate(fi):
        if i < 50:
            print(line)

and these sentences were added to the sentence bucket which is why this went undetected.

jowagner commented 3 years ago

In case you need to read the tar file, here a code fragment I wrote a few weeks ago for the nlp class:

    with open(data_path, 'rb') as tgz_stream:
        with tarfile.open(
            mode = 'r|gz',
            fileobj = tgz_stream
        ) as tar_archive:
            for tar_member in tar_archive:
                path = tar_member.name
                if path.endswith('.txt'):
                    f = tar_archive.extractfile(tar_member)
                    document = [
                        line.decode('utf-8').split()
                        for line in f.readlines()
                    ]
                    yield document

Edit: fixed error in code

jbrry commented 3 years ago

1611ab305b2d9fa8927b29297de8607706a28db2 now unzips and untars the downloaded file and moves the .txt file back into the <corpus>/raw directory. scripts/text_processor.py then operates on just this text file so no need to read from the tar file which would be a bit more complicated as there are other .txt files (readme and sha) in the tar as well but thanks for the code!

jbrry / Irish-BERT

Pipeline reports null bytes and jwagner file metadata in oscar corpus #67