jmacd / xdelta

open-source binary diff, delta/differential compression tools, VCDIFF/RFC 3284 delta compression
http://xdelta.org
1.09k stars 181 forks source link

Unable to use xd3_encode_input with winsize < input size #238

Open Jakz opened 6 years ago

Jakz commented 6 years ago

I'm trying to integrate xdelta3 inside a program I'm writing. I tried to follow all the guidelines from wiki, comments inside code and by stepping through the code itself to understand how it works but I'm not able to reliably produce good patches and I'm not understanding why.

Xdelta3 code is integrated in a pipe like structure in which a process() function is called and that function should read available input from a memory buffer and produce result in an output buffer (which is then managed outside that code).

I'm able to produce a good patch only when windowSize == inputSize so that just a window is created. If I reduce the window size then the patch is generated but when applied to the source file it produces a different input file, and the differences are somewhat predictable so there must be something wrong in how I'm understanding the algorithm.

My understanding are that you start with some input, then xd3_encode_input will keep returning XD3_INPUT until a full window is ready (or XD3_FLUSH is set as with EOF), then it returns a XD3_WINSTART, then it will get source data if available (by asking through XD3_GETSRCBLK all the time he needs). Then it will return XD3_OUTPUT multiple times to let the caller save all the produced output and finally a XD3_WINFINISH. The loop goes this way until you reach EOF.

Starting from this my code is the following:

void xdelta3_filter::init()
{
  memset(&_stream, 0, sizeof(_stream));
  memset(&_config, 0, sizeof(_config));
  memset(&_xsource, 0, sizeof(_xsource));

  /* configuration */
  xd3_init_config(&_config, 0);

  _config.winsize = _windowSize;
  _config.sprevsz = _windowSize >> 2;
  _config.getblk = nullptr;

  _config.flags = XD3_SEC_DJW | XD3_COMPLEVEL_9;

  int r = xd3_config_stream(&_stream, &_config);
  assert(r == 0);

  /* setting source */
  _xsource.blksize = _sourceBlockSize;
  _xsource.max_winsize = _sourceBlockSize;

  _xsource.onblk = 0;
  _xsource.curblkno = 0;
  _xsource.curblk = nullptr;

  r = xd3_set_source(&_stream, &_xsource);
  assert(r == 0);

  _state = XD3_INPUT;
}

and my process function is the following:

void xdelta3_filter::process()
{
  /* ask to flush input when all available input has been received and _in buffer is empty */
  if (_isEnded && _in.empty() && !(_stream.flags & XD3_FLUSH))
  {
    printf("%p: xdelta3::process flush request (setting XD3_FLUSH flag)\n", this);
    xd3_set_flags(&_stream, _stream.flags | XD3_FLUSH);
  }

  usize_t effective = xd3_min(_stream.winsize, _in.used());

  /* if xd3 was awaiting for input (this is also the inital state set by init()) */
  if (_state == XD3_INPUT && effective > 0)
  {
    xd3_avail_input(&_stream, _in.head(), effective);

    static size_t ___total_in = 0;
    ___total_in += _in.used();
    printf("%p: xdelta3::process consumed %lu bytes (XD3_INPUT) (total: %lu)\n", this, effective, ___total_in);

    _in.consume(effective);
  }

  _state = xd3_encode_input(&_stream);

  switch (_state)
  {
    case XD3_OUTPUT:
    {
      printf("%p: xdelta3::process produced bytes (avail_out: %lu, total_out: %lu) (XD3_OUTPUT)\n", this, _stream.avail_out, _stream.total_out);

      size_t produced = _stream.avail_out;

      while (produced > _out.available()) _out.resize(_out.capacity()*2);

      memcpy(_out.tail(), _stream.next_out, produced);
      _out.advance(produced);

      xd3_consume_output(&_stream);

      break;
    }

    case XD3_INPUT:
    {
      break;
    }

    case XD3_GOTHEADER:
    {
      printf("%p: xdelta3::process got %s\n", this, nameForXdeltaReturnValue(_state));
      break;
    }

    case XD3_WINSTART:
    {
      printf("%p: xdelta3::process started window (avail_in: %lu) (XD3_WINSTART)\n", this, _stream.avail_in);
      break;
    }

    case XD3_WINFINISH:
    {
      printf("%p: xdelta3::process finished window (XD3_WINFINISH)\n", this);

      if (_isEnded && _in.empty() && _stream.buf_avail == 0 && _stream.buf_leftover == 0)
      {
        printf("%p: xdelta::process input was finished so encoding is finished\n", this);
        _finished = true;
      }

      break;
    }

    case XD3_GETSRCBLK:
    {
      printf("%p: xdelta3::process block request %lu (XD3_GETSRCBLK)\n", this, _xsource.getblkno);

      assert(_sourceBuffer.capacity() >= _sourceBlockSize);

      const xoff_t blockNumber = _xsource.getblkno;
      const off_t offset = _sourceBlockSize * blockNumber;
      const usize_t size = std::min(_sourceBlockSize, (usize_t)(_source->size() - offset));

      _source->seek(offset);
      _source->read(_sourceBuffer.head(), size);

      _xsource.onblk = size;
      _xsource.curblkno = blockNumber;
      _xsource.curblk = _sourceBuffer.head();

      break;
    }

    default:
      assert(false);
  }
}

Now the code is quite similar to xd3_process_stream but it doesn't work when input size is greater than window size. For example if I fed a source.bin and input.bin (which is source.bin with 1024 random byte modified) of 32kb with a window size of 16kb, the resulting patch generated a a new file which has first 16kb different from original input and last 16kb identical. So this is surely something related in how I'm interfacing the encoder but I'm not able to understand what it could be.

Jakz commented 6 years ago

Ok, looks like I found the issue by inspecting the resulting vcdiff with printdelta subcommand. The problem is that the input must remain valid until it is consumed, and when (and if) this is done depends on how you feed input to the encoder.

So according to comments:

For the last point I'd like to make an example to understand if I got it right, suppose window size is 10, you feed 3, 3, 3, 3. Input of first 3 chunks is copied in an internal buffer (so it can be released from the caller), on 4th chunk it happens that total buffer becomes 12 so a full window of 10 is moved to avail_in, it is processed and we have a leftover of 2 which is kept buffered.

Now if you feed a chunk that is larger than window size it will be buffered and split to a window size in any case? So if you have a leftover and you don't flush it you can't encode chunks larger than the window anymore in a single call, is this correct?