Incorrect Document Id while Seeding Documents

rafaharo commented 10 years ago

Hi guys,

While adapting the connector to last version of Manifold (aka Manifold 1.7-SNAPSHOT), we have faced the following problem:

In the addSeedDocuments method of the connector, you are seeding the alfresco documents using the JSON obtained from fetchNodes method of the Alfresco client as DocumentId instead of only the uuid. This JSON, apart from the uuid, contains more information that is useful later in the processDocuments method.

Now, in the processDocuments method, documents are being injected using only the uuid as document ID. This is not longer valid in this version of Manifold, because now documents are being processed in a pipeline where the original document ID is checked. So, probably, the JSON is being stored in the database after seeding the documents and when the documents are ingested with a difference ID, an exception is raised.

I understand that you use the whole JSON for seeding because of performance to avoid calling Alfresco one more time per document in the processDocument method. It would be nice if the seeding method would allow to include additional metadata, but currently this is not possible.

So, I'm afraid that the workflow for this version needs to change and now a first call to Alfresco has to be done to fetch only UUIDs and later fetch the rest of Node information (one call per document). This is, of course, less efficient but it is the cleanest way rather than store in memory Node information in the connector for seeding documents until those are ingested.

Because this is a major change, I would probably suggest you guys to open a branch for version 1.7. I will be happy to contribute the changes :-)

maoo commented 10 years ago

Branch created - https://github.com/maoo/alfresco-webscript-manifold-connector/tree/manifold-1.7

I didn't really understand why Document IDs should replace UUIDs, but I'm confident that you have a very clear understanding of the Manifold mechanisms, so please proceed with the code and we can follow up this discussion on some code changes.

Thanks for the effort!

rafaharo commented 10 years ago

Thanks Maurizio!

Well probably I didn't explain correctly myself. The problem is currently the whole JSON from fetchNodes (with uuid, type, deleted, lastAclChangesetId, etc..) is being used as DocumentID while seeding:

final AlfrescoResponse response = alfrescoClient.fetchNodes(lastTransactionId, lastAclChangesetId);
        int count = 0;
        for (Map<String, Object> doc : response.getDocuments()) {
          String json = gson.toJson(doc);
          activities.addSeedDocument(json);
          count++;
        }

Later in the processDocuments method, only the uuid is being used for ingested those seeded documents:

public void processDocuments(String[] documentIdentifiers, String[] versions,
                               IProcessActivity activities, DocumentSpecification spec,
                               boolean[] scanOnly, int jobMode) throws ManifoldCFException,
          ServiceInterruption {
    for (String doc : documentIdentifiers) {
      Map<String, Object> map = gson.fromJson(doc, Map.class);
      RepositoryDocument rd = new RepositoryDocument();
      String uuid = map.get("uuid").toString();
      rd.setFileName(uuid);
      for (Entry<String, Object> e : map.entrySet()) {
        rd.addField(e.getKey(), e.getValue().toString());
      }

      if ((Boolean) map.get("deleted")) {
        activities.deleteDocument(uuid);
      } else {
        if (this.enableDocumentProcessing) {
          processMetaData(rd,uuid);
        }
        try {
            logger.info("Ingesting with id: {}, URI {} and rd {}", String.valueOf(uuid), uuid, rd.getFileName());
            activities.ingestDocumentWithException(String.valueOf(uuid), "", uuid, rd);
        } catch (IOException e) {
            throw new ManifoldCFException(
                    "Error Ingesting Document with ID " + String.valueOf(uuid), e);
        }
      }
    }
  }

Apparently, this is not longer correct because seeded documents IDs should be the same than ingested Document IDs, probably because of the new pipeline workflow

philipmeadows / alfresco-webscript-manifold-connector

Incorrect Document Id while Seeding Documents #17