marklogic-community / marklogic-spring-batch

Write batch processing applications in MarkLogic
Other
8 stars 26 forks source link

add support to loading tables as triples #289

Closed divino closed 6 years ago

divino commented 6 years ago

Refer to https://github.com/divino/ml-migration-starter/tree/migrate-as-triplesv4 for an example use.

divino commented 6 years ago

run the started with "gradle setupH2 migrate"

divino commented 6 years ago

Dino: By the way, I have some design issues that prevent me to go ahead fast with this:

    1) My goal was to make the reader, writer and processor as decoupled as possible but I also don’t want to lose the richness of data that comes to the reader which is the reason I removed processor. These components ideally should be plug and play with other readers or writers.

    2) I think for maximum flexibility we might be better off just using JdbcCursorItemReader. The downside is it will lose the benefit of the current sample project that can just get all the tables and write them as triples.

    3) If I use JdbcCursorItemReader, I also have to figure out how to create dynamic jobs in spring batch. From what I read, looks like it is not possible or we do tasklets but tasklets doesn’t use reader, writer and processor (at least based on my research).

    4) I also want to inject in the future the ability to load configuration files that will handle; many-to-many type of relationships or an in cases we want to control what ID to use or when there is no primary key, and also to handle foreign keys dynamically.

Damon: Sounds like good work, and I’d like to understand this a little better. For Reader/Writer/Processor, the key will be the input/output interfaces. What does the Reader read, and what does the Writer accept as input are therefore key questions. (This will then determine the input/output for the Processor.)

Dino: Reader - reads the row and save it as map together with the table metadata Writer - accepts the map and write it as triples with the following format: Subject /# Predicate //has Object with datatype equal to datatype from the metadata Graph Name: /

Damon:
If those are generic concepts, I expect we will get good re-use from these components. A quick browse suggests that the Reader reads a Map of objects (representing rows), but I did not see table metadata awareness – will this write all triples in a format that does not know/understand what came from which table?

Dino: I am using data type of the column value to set also the datatype of the object of the triple. I am also using the primary key on the subject.

Damon: The writer, writing a bunch of triples, makes sense, though I wonder if there is a MarkLogic client API way to do that.

Dino: Ohhh … looks like there is but Jena provides is makes it a little more convenient.

This is for ML Client:

GraphManager gmgr = client.newGraphManager();
StringHandle stringHandle = new StringHandle()
    .with("<http://example.org/subject2> " +
          "<http://example.org/predicate2> " +
          "<http://example.org/object2> .")
    .withMimetype(RDFMimeTypes.TURTLE);
graphMgr.merge("myExample/graphUri", stringHandle);

on Jena :

MarkLogicDatasetGraph dataSetGraph = MarkLogicDatasetGraphFactory
                .createDatasetGraph(client);
        String baseUri = "http://sample.org/";
        Node node = NodeFactory.createURI(baseUri);
        Graph graph = GraphFactory.createDefaultGraph();
        dsg.addGraph(node, graph);
        dsg.getGraph(node).add(new Triple(
                NodeFactory.createURI(baseIri + "#100"),
                NodeFactory.createURI(baseIri + "/hasName"),
                NodeFactory.createLiteral("Norma")
        ));

Damon:
I do see a processor in there, so not sure why you say below there’s not processor.

Dino: I am using a PassThroughItemProcessor … right now it is just a dummy -- used only because a processor is required by the step.

Damon:
Can you describe all this in terms of the reader/writer/processor and data types used?

Dino: Reader returns data as Map Processor accept Map and returns Map Writer accepts a Map

sastafford commented 6 years ago

Dino, can you close this pull request and re-open it to merge with the DEV branch? This is a request against master which are intended for releases.