rdfjs / stream-spec

RDF/JS: Stream interfaces – A specification of a low level interface definition representing RDF data independent of a serialized format in a JavaScript environment.
https://rdf.js.org/stream-spec/
5 stars 2 forks source link

Parser or Transform interface #7

Closed bergos closed 5 years ago

bergos commented 8 years ago

The Source interface doesn't define how an input stream like a file stream is feed to the Source instance. A common way for this would be usefull. Because the Source interface is not exclusive for parsers we should avoid methods like .parse. We can define a new name like .transform, which describes the input/output behavior. It would be also possible to align it to the Sink interface and name it .import. That's how the interface should look like:

class Transform {
    // all instance specific options must be assigned in the constructor
    constructor (options) {
      this.options = options
    } 

    transform(inputStream) {
      // outputStream implements the Source interface
      return outputStream
    }
}
RubenVerborgh commented 8 years ago

Transform is the interface used to go from one stream to another. For just consumers, I don't think it's the right name.

bergos commented 8 years ago

But that's what it does. It looks a little bit different, because we use only Readable streams. But we can use a different name, to make clear it's not the Node.js Duplex (Readable/Writeable) Transform.

Proposals for other names are very welcome!

alexmilowski commented 8 years ago

I've always thought of RDFa processing as "harvesting" quads from a resource. Meanwhile, JSON-LD and Turtle are formats for triples and so we "parse" the format to read the triples. In a hybridized environment (i.e., JSON-LD or Turtle in script elements with possible RDFa), we're back in the "harvest" mode where we are reading information from the markup to produce triples.

In that sense, "harvesting" requires traversing a data format to find information from which quads can be constructed. In contrast, "parsing" assumes you have a specific quad-oriented format as a sequence of characters.

Green Turtle does both and makes this distinction - mostly due to the starting point of processing documents. The processor is "attached" to a documented (i.e., harvesting) and then it has specific "parsers" for each triple data format supported (i.e., JSON-LD, Turtle, Microdata). When it constructs a string from the document in a particular format, it invokes the data format parser using a "parse" method.

Meanwhile, from a document/markup perspective, each data format defines its own way to process the attached document. It isn't a pure visitor pattern so each data format may traverse the document with its own algorithm. I would certainly optimize that differently in a refactoring.

I would like to see interfaces that presume more that streaming interfaces. If you start with a built-document (e.g., a node), you should be able tree that as a source. Similarly, if you start with some kind of hybridized input (e.g., markup that contains more than one representation of quads), you should be able to both harvest triples and parse various data formats contained to produce one stream of quads as output.

gkellogg commented 8 years ago

That's basically what the Ruby RDFa reader does: create a DOM from the input file, use it to parse RDFa. Subsequently, look for script elements and locate other readers from the @type attribute on the script element, passing them the text content of the script element. Also, if the document contains @itemscope, parse the DOM using a Microdata reader. If it contains an rdf:RDF element, read RDF/XML.

Note that script elements may contain some comment information that may need to be stripped out to be legal input.

bergos commented 8 years ago

Yes, it's a good idea to extend the interface to support also non stream objects. We also need an interface for the serializer. I was thinking again about the name. What about Reader and Writer?

Reader Interface:

class Reader {
  constructor (options) {
    this.options = options
  }

  read (input) {
    return source
  }
}

Example:

readerInstance.read(readableFileStream).on('data', (quad) => {
  // handle quads
}).on('end', () => {
  // done
})

Writer interface:

class Writer {
  constructor (options) {
    this.options = options
  }

  write (output) {
    return sink
  }
}

Example:

writerInstance.write(writeableFileStream).import(quadStream).on('end', () => {
  // done
})
bergos commented 8 years ago

I created a PR for the interfaces. Please add your comments there and vote +/-1 for it.

elf-pavlik commented 5 years ago

@bergos can we close this issue?