lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Loading ARC files produces record size errror #199

Open jrwiebe opened 8 years ago

jrwiebe commented 8 years ago

I've been working with ARC files a fair bit recently, so I've become quite familiar with the error message that appears each time a file is loaded. It looks something like this:

ERROR ArcRecordUtils - Read 1222 bytes but expected 1298 bytes. Continuing...

I noticed that for the files I was working with, the size delta was always the same (76 bytes). I looked into this and realized this was the length of the second and third lines of the version block at the top of every file. E.g.:

1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length

These two lines are counted in the total record size but they are skipped over when the record content is read, so when copyStream encounters these records it always expects more. The second line represents the file format version (e.g., 1.1) and the origin of the file, and the third line is currently hard-coded in org.archive.io.arc.ARCRecord.

In branch arc-tobytes I modified ArcRecordUtils to fix this, but the fix is incomplete because there is no simple way to get the origin of the archive file. For the time being I've hard-coded "InternetArchive", which works for all the ARC files we have. There is also no way, using the ARCRecord-related classes, to recover the third line, although this is probably even less likely to actually cause an issue. (Are there non-WAC tools producing ARCs out there?)

My code changes toBytes() and getContent(). The former (note the if (meta.getOffset() == 0) block):

public static byte[] toBytes(ARCRecord record) throws IOException {
    ARCRecordMetaData meta = record.getMetaData();

    String metaline = meta.getUrl() + " " + meta.getIp() + " " + meta.getDate() + " "
        + meta.getMimetype() + " " + (int) meta.getLength() + "\n";
    String versionEtc = "";

    if (meta.getOffset() == 0) {
      versionEtc = meta.getVersion().replace(".", " ") +
              " InternetArchive\n" + // Should have meta.getOrigin()
              "URL IP-address Archive-date Content-type Archive-length\n";
      metaline += versionEtc;
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(baos);
    dout.write(metaline.getBytes());
    copyStream(record, (int) meta.getLength() - versionEtc.length(), true, dout);

    return baos.toByteArray();
  }

I can't think of any cases where meta.getOffest() would be 0 other than at the beginning of a file, which should always start with the version block. I suppose we could also check that the URL begins with filedesc:// to be extra safe.

@lintool, what do you think about this? Should I make a request to the @iipc folks to store the info from lines 2 and 3 of the block in ARCRecord? I suppose we could always extend their classes. Or drop the error checking. I think it would be nice to get this fixed one way or another, so that users don't see error messages suggesting data loss.

jrwiebe commented 8 years ago

When the next version of webarchive-commons (1.1.7) is released it will fix this issue.

anjackson commented 8 years ago

BTW, 1.1.7 is out now.