Closed jowagner closed 3 years ago
Yes, I think that's it. scripts/text_processor.py only works with .bz2, .gz and .txt. What has happened here is that the script unzips the file but it is still a .tar.
The first 40 or so lines in "oscar00" are null bytes and then there is a README about the OSCAR corpus which was included in the .tar (not pasted for privacy reasons):
./PaxHeaders.28723/oscar0000644000000000000000000000013213705575523012023 xustar0030 mtime=1595341651.528407163
30 atime=1595341652.220401845
30 ctime=1595341651.528407163
oscar/0000755001164200001440000000000013705575523013056 5ustar00jwagnerusers00000000000000oscar/PaxHeaders.28723/unshuffled0000644000000000000000000000013213705575732014033 xustar0030 mtime=1595341786.339335015
30 atime=1595341786.911330458
30 ctime=1595341786.339335015
oscar/unshuffled/0000755001164200001440000000000013705575732015223 5ustar00jwagnerusers00000000000000oscar/unshuffled/PaxHeaders.28723/ga0000644000000000000000000000013213705576164014422 xustar0030 mtime=1595341940.454106263
30 atime=1595341941.030101674
30 ctime=1595341940.454106263
oscar/unshuffled/ga/0000755001164200001440000000000013705576164015612 5ustar00jwagnerusers00000000000000oscar/unshuffled/ga/PaxHeaders.28723/readme.txt0000644000000000000000000000013213705575772016501 xustar0030 mtime=1595341818.059082119
30 atime=1595341950.078029535
30 ctime=1595341818.059082119
oscar/unshuffled/ga/readme.txt0000644001164200001440000000344413705575772017621 0ustar00jwagnerusers0000000000000002/07/2020 18:43
The remainder of the file appears to be plain text and can be read with a simple python script, e.g.:
with open("oscar-ga-unshuffled.tar", 'rt') as fi:
for i, line in enumerate(fi):
if i < 50:
print(line)
and these sentences were added to the sentence bucket which is why this went undetected.
In case you need to read the tar file, here a code fragment I wrote a few weeks ago for the nlp class:
with open(data_path, 'rb') as tgz_stream:
with tarfile.open(
mode = 'r|gz',
fileobj = tgz_stream
) as tar_archive:
for tar_member in tar_archive:
path = tar_member.name
if path.endswith('.txt'):
f = tar_archive.extractfile(tar_member)
document = [
line.decode('utf-8').split()
for line in f.readlines()
]
yield document
Edit: fixed error in code
1611ab305b2d9fa8927b29297de8607706a28db2 now unzips and untars the downloaded file and moves the .txt
file back into the <corpus>/raw
directory. scripts/text_processor.py
then operates on just this text file so no need to read from the tar file which would be a bit more complicated as there are other .txt files (readme and sha) in the tar as well but thanks for the code!
The
.out
file James shared 2021-04-02 has messages about null bytes in the oscar corpus:and the quoted sequence contains not only long sequences of null bytes but also
jwagner
,755
,644
and strings likeoscar/unshuffled/ga/PaxHeaders.28723/readme.txt
. The.bz2
format does not specify fields for such file metadata and long sequences of the same byte are extremely unlikely in compressed output. Is this maybe a.tar.bz2
file, the pipeline uncompressed the bz2 stream but does not understand.tar
?