internetarchive / warc

Python library for reading and writing warc files
GNU General Public License v2.0
237 stars 114 forks source link

Fix WARC writing bug #23

Open jeffcasavant opened 8 years ago

jeffcasavant commented 8 years ago

I had this issue writing a WARC record to a file:

[jeff@lamarzocco warcs]$ ./cleanwarc.py  in.warc.gz filtered.warc.gz
Traceback (most recent call last):
  File "./cleanwarc.py", line 86, in <module>
    main()
  File "./cleanwarc.py", line 82, in main
    filter_warc(args.infile, args.outfile)
  File "./cleanwarc.py", line 61, in filter_warc
    output_warc.write_record(record)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 268, in write_record
    warc_record.write_to(self.fileobj)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 161, in write_to
    f.write(self.payload)
  File "/usr/lib/python2.7/site-packages/warc/gzip2.py", line 71, in write
    BaseGzipFile.write(self, data)
  File "/usr/lib/python2.7/gzip.py", line 240, in write
    if len(data) > 0:
AttributeError: FilePart instance has no attribute '__len__'

I added a __len__ function to FilePart to fix this, but got this error:

Traceback (most recent call last):
  File "./cleanwarc.py", line 90, in <module>
    main()
  File "./cleanwarc.py", line 86, in main
    filter_warc(args.infile, args.outfile)
  File "./cleanwarc.py", line 65, in filter_warc
    output_warc.write_record(record)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 268, in write_record
    warc_record.write_to(self.fileobj)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 161, in write_to
    f.write(self.payload)
  File "/usr/lib/python2.7/site-packages/warc/gzip2.py", line 71, in write
    BaseGzipFile.write(self, data)
  File "/usr/lib/python2.7/gzip.py", line 241, in write
    self.fileobj.write(self.compress.compress(data))
TypeError: must be string or read-only buffer, not instance

This PR fixes both issues by passing the buf attribute of the FilePart (rather than the whole FilePart) to gzip.

wolfgangmeyers commented 7 years ago

This would be great to merge. Is the project abandoned?

jeffcasavant commented 7 years ago

@wolfgangmeyers I guess? This has seen no attention since I submitted it, getting on a year ago. Figured it would be a no-brainer :stuck_out_tongue: Who's the maintainer?