owncloud / files_antivirus

:space_invader: virus scanner for ownCloud
GNU Affero General Public License v3.0
48 stars 30 forks source link

Scanner architecture next steps #515

Open DeepDiver1975 opened 2 years ago

DeepDiver1975 commented 2 years ago
    For the scanner architecture, I'd aim for something like the following diagram:

Screenshot from 2022-11-17 10-55-44

Outside of the ICAP scanner, I'm not sure if we're officially supporting any scanner other than clamav, otherwise we can adjust the ClamavScanner class name to another one.


For the ClamavScanner there are a couple of options:

abstract class ClamavScanner implements IScanner {
  abstract public function initScanner();

  abstract protected function prepareToScan(IScannable $item);

  final public function scan(IScannable $item) {
    $this->prepareToScan($item);
    while (($chunk = $item->fread()) !== false) {
      $this->sendData($chunk);
    }
    return $this->finishScan($item);
  }

  abstract protected function sendData($data);

  abstract protected function finishScan(IScannable $item);

  abstract public function shutdownScanner();
}

or

abstract class ClamavScanner implements IScanner {
  private $scannerConfig;

  public function initScanner() {
    $this->scannerConfig = ConfigHelper::getConfig();
  }

  abstract protected function prepareScan(IScannable $item, $scannerConfig);

  final public function scan(IScannable $item) {
    $this->prepareScan($item, $this->scannerConfig);
    $stop = false;
    while (($chunk = $item->fread()) !== false || $stop) {
      $stop = $this->sendData($chunk, $this->scannerConfig);
    }
    return $this->finishScan($item, $this->scannerConfig);
  }

  abstract protected function sendData($data, $scannerConfig);

  abstract protected function finishScan(IScannable $item, $scannerConfig);

  public function shutdownScanner() {
  }
}

The main point is that the scan function will be final and it won't be modified. The abstract methods are expected to be implemented by the subclasses in different ways. For example, the prepareScan method might open a pipe to a file in case of the LocalScanner, but the DaemonScanner might open a socket instead. Buffering can also be implemented by the subclasses, if needed, in the sendData method in order to send a bigger chunk of data. The method signature can also be adjusted according to necessities. The abstract class can provide some configuration to each implementation, likely from the same place, instead of letting each implementation fetch the configuration from different places.


For the ICAPScanner, the idea is the same:

abstract class ICAPScanner implements IScanner {
  public function initScanner() {
  }

  abstract protected function prepareRequest(IScannable $item): ICAPRequestData;

  final public function scan(IScannable $item) {
    $requestData = $this->prepareRequest($item);

    $client = new ICAPClient($requestData->getHost(), $requestData->getPort());
    if ($requestData->getType() === 'reqmod') {
      $response = $client->reqmod($requestData->getService(), $requestData->getBody(), $requestData->getHeaders());
    } else {
      $response = $client->respmod($requestData->getService(), $requestData->getBody(), $requestData->getHeaders());
    }
    return $this->processResponse($response);
  }

  abstract protected function processResponse(array $data): Status;

  public function shutdownScanner() {
  }
}

In this case, the implementations are expected to provide whatever the ICAPClient needs to perform the request in the prepareRequest method and return it as an ICAPRequestData. Once the ICAPClient sends the request, the implementations have a chance to process the response.

There might be some adjustments to be made (ICAPClient might need to handle a file stream somehow) but it should fit the current code.

Responsibilities are quite clear in this case: the ICAPScanner will send the request, while the implementations will just prepare the data and process the result.

Also note that, if we need to provide a generic ICAP scanner, it should have its own implementation at the same level of McAfeeWeb and Fortinet. Trying to implement it in the ICAPScanner will be confusing because the responsibilities won't be clear any longer.


I think this should cover everything we have at the moment and it should allow some improvements without major code changes. So the question is how we can reach this proposed goal.

I'd propose the following steps:

  1. Get rid of the asyncScan and focus only on the scan method. We'd need to touch all scanners for this, but I don't think there is any other option.
  2. Refactor the ICAP scanners as proposed. I hope this is just moving code around with minor touches.
  3. Refactoring the rest of the scanners might be left for another PR as long as they fit the proposed solution and work.

In any case, I think it's critical to have code documentation explaining what are the expectations and who is responsible of what.

_Originally posted by @jvillafanez in https://github.com/owncloud/files_antivirus/issues/512#issuecomment-1318474846_

jvillafanez commented 2 years ago

For the current asyncScan which is performed while an upload happens, we'll have to extend the current plans. The main problem is that we have to scan (or send to scan) partial data that is being written into a stream. Right now, this is done through a stream wrapper.

Conceptually, we'll split the architecture between what the scanners are expected to provide to the outside, and how the scanners are expected to be implemented.

The IScanner will add the following methods to the ones that are planned

interface IScanner {
  public function startAsync($params);
  public function onAsyncData($data);
  public function finishAsync(): Status;
}

Note that we might need to define what exceptions can be thrown from each of those methods.

We'll likely have additional requirements to be implemented, such as:

The expected usage would be something like (just little touches here and there, but it should be the same as what is currently in place):

  $scanner->initScanner();
  $scanner->startAsync(['filename' => 'eicarvirus.test']);
  return CallBackWrapper::wrap(
    $stream,
    null,
    function ($data) use ($scanner) {
      $scanner->onAsyncData($data);
    },
    function () use ($scanner, $path) {
      $status = $scanner->finishAsync();
      if ($status === Status::INFECTED) {
        throw new Exception(....)
      }
    }
  );

Additional metadata might be needed for the async scan.


The implementations, however, won't require any change. The main trick is that both the ClamavScanner and the ICAPScanner abstract classes will implement those methods using the same abstract methods for the regular scan.

In order to do so, first will need to implement a FakeStreamScannable, mainly to hold the stream metadata while implementing the IScannable interface. We might need to make the IScannable interface a but more flexible to return more metadata, not just the filename.

class FakeStreamScannable implements IScannable {
  private $filename;

  public function __construct($filename) {
    $this->filename = $filename;
  }

  public function fread() {
    return false;
  }

  public function getFilename() {
    return $this->filename;
  }
}
abstract class ClamavScanner implements IScanner {
  final public function startAsync($params) {
    $this->runningAsync = true;
    $this->fakeAsyncScannable = new FakeStreamScannable($params['filename']);
    $this->prepareScan($fakeScannable);
  }

  final public function onAsyncData($data) {
    if ($this->runningAsync) {
      $this->sendData($data);
    } else {
      throw new \Exception(.....);
    }
  }

  final public function finishAsync(): Status {
    $this->runningAsync = false;
    return $this->finishScan($this->fakeScannable);
  }
}
abstract class ICAPScanner implements IScanner {
  final public function startAsync($params) {
    $this->runningAsync = true;
    $this->fakeAsyncScannable = new FakeStreamScannable($params['filename']);
    $this->preparedData = $this->prepareRequest($fakeScannable);
  }

  final public function onAsyncData($data) {
    if ($this->runningAsync) {
      $this->data .= $data;
    } else {
      throw new \Exception(.....);
    }
  }

  final public function finishAsync(): Status {
    $this->runningAsync = false;
    $requestData = $this->preparedData;

    $client = new ICAPClient($requestData->getHost(), $requestData->getPort());
    if ($requestData->getType() === 'reqmod') {
      $response = $client->reqmod($requestData->getService(), $requestData->getBody(), $requestData->getHeaders());
    } else {
      $response = $client->respmod($requestData->getService(), $requestData->getBody(), $requestData->getHeaders());
    }
    return $this->processResponse($response);
  }
}

(Code shown is based on what we currently have. It might be improved)

The important point is that subclasses will only have 2-3 methods to implement, and those will be reused for both the sync and async scans.


There might be some adjustments to be done with the overall plan. The ClamavScanner and ICAPScanner require subclasses to deal with IScannable objects. While this is fine, we have to create a FakeStreamScannable class to be used internally for the async scan. Taking into account that the abstract methods are unlikely to use the actual data from the IScannable object but just the metadata (currently only the filename), they could use an array or an hypothetical Metadata from an hypothetical $scannableObj->getMetadata(): Metadata method. This way we wouldn't need to create the FakeStreamScannable class.