philipmeadows / alfresco-webscript-manifold-connector

Alfresco Solr API Repository Connector for Apache ManifoldCF
11 stars 11 forks source link

Getting rid of JobIdStealer.java #13

Open maoo opened 10 years ago

maoo commented 10 years ago

As documented in the class itself....

 * This is a hack to get the jobId given a {@link org.apache.manifoldcf.crawler.system.SeedingActivity}.
 * TODO: If a way to get the job id from within a connector in ManifoldCF is found, delete this.

This class is used to access the Job ID from the Alfresco Manifold Connector. We use the Job ID as identifier (primary key) of entries that we log (using CrawlLogger.java) into the connector to keep the state of the crawling (Last Transaction ID, Last ACL Changeset ID)

In order to solve this issue it's possible to:

maoo commented 10 years ago

Apache Manifold is planning on adding a Seeding Version String, that can be used to replace the JobID and therefore deprecate JobIdStealer - https://issues.apache.org/jira/browse/CONNECTORS-971

maoo commented 10 years ago

I can see the following method being added to BaseRepositoryConnector.java

  @Override
  public String addSeedDocumentsWithVersion(ISeedingActivity activities, Specification spec,
    String lastSeedVersion, long seedTime, int jobMode)
    throws ManifoldCFException, ServiceInterruption

I can use lastSeedVersion to store lastTxnId and lastACLChangesetId, divided by char '|' and deprecating JobIdStealer and CrawlLogger classes:

      StringTokenizer tokenizer = new StringTokenizer(lastSeedVersion,"|");
      long lastTransactionId = 0;
      long lastAclChangesetId = 0;

      if (tokenizer.countTokens() == 2) {
         lastTransactionId = new Long(tokenizer.nextToken());
         lastAclChangesetId = new Long(tokenizer.nextToken());
      }

However, I don't know how to update the lastSeedVersion as soon as I collect the last docs processed; the old syntax is

crawlLogger.log(JobIdStealer.stealId(activities), lastTransactionId, lastAclChangesetId);

I was expecting to use a syntax like

super.setLastSeedVersion(lastTransactionId + '|' + lastAclChangesetId);

but maybe I'm misinterpreting this feature.

maoo commented 10 years ago

As confirmed by Manifold Committers, using

return lastTransactionId + "|" + lastAclChangesetId;

was enough to update the lastSeedVersion; tests are passing, now it needs some integration testing

maoo commented 10 years ago

Many thanks to @OpenPj and Karl Wright for the support on this issue!