uvacw / inca

24 stars 6 forks source link

importer class incorrectly assumes that all documents are of one single doctype #319

Open damian0604 opened 6 years ago

damian0604 commented 6 years ago

It seems that the class Importer(BaseImportExport) as defined in core/import_export_classes assumes that the batch to be imported is of a single doctype (doctype is a mandatory argument of the .run() method.

This makes the importer incompatible with the exporter: If I use the exporter to export a bunch of JSON documents which happen to have multiple doctypes, I cannot import them back using the importer.

It would be nice if this could be fixed, so that the json-importers/exporters in the importers_exporters/ folder can be used to transfer documents between ES instances and for backup purposes.

This needs to be resolved to solve https://github.com/uvacw/inca/issues/291

damian0604 commented 6 years ago

I just see that of course we HAVE already fixed this once, namely in the LexisNexis one. There we do it as follows:

def run(self, path, *args, **kwargs):
        """uses the documents from the load method in batches """

        # this method is overwritten because in contrast to
        # other importers, we do not have a single doctype.
        # Each document can have a different one.
        for doc in self.load(path, *args,**kwargs):
            self._ingest(iterable=doc, doctype=doc['doctype'])
        self.processed += 1

Anyhow, we need to make sure it works for JSON (and in principle also CSV) as well.