Closed bergos closed 5 years ago
Transform
is the interface used to go from one stream to another. For just consumers, I don't think it's the right name.
But that's what it does. It looks a little bit different, because we use only Readable streams. But we can use a different name, to make clear it's not the Node.js Duplex (Readable/Writeable) Transform.
Proposals for other names are very welcome!
I've always thought of RDFa processing as "harvesting" quads from a resource. Meanwhile, JSON-LD and Turtle are formats for triples and so we "parse" the format to read the triples. In a hybridized environment (i.e., JSON-LD or Turtle in script elements with possible RDFa), we're back in the "harvest" mode where we are reading information from the markup to produce triples.
In that sense, "harvesting" requires traversing a data format to find information from which quads can be constructed. In contrast, "parsing" assumes you have a specific quad-oriented format as a sequence of characters.
Green Turtle does both and makes this distinction - mostly due to the starting point of processing documents. The processor is "attached" to a documented (i.e., harvesting) and then it has specific "parsers" for each triple data format supported (i.e., JSON-LD, Turtle, Microdata). When it constructs a string from the document in a particular format, it invokes the data format parser using a "parse" method.
Meanwhile, from a document/markup perspective, each data format defines its own way to process the attached document. It isn't a pure visitor pattern so each data format may traverse the document with its own algorithm. I would certainly optimize that differently in a refactoring.
I would like to see interfaces that presume more that streaming interfaces. If you start with a built-document (e.g., a node), you should be able tree that as a source. Similarly, if you start with some kind of hybridized input (e.g., markup that contains more than one representation of quads), you should be able to both harvest triples and parse various data formats contained to produce one stream of quads as output.
That's basically what the Ruby RDFa reader does: create a DOM from the input file, use it to parse RDFa. Subsequently, look for script
elements and locate other readers from the @type
attribute on the script element, passing them the text content of the script
element. Also, if the document contains @itemscope
, parse the DOM using a Microdata reader. If it contains an rdf:RDF
element, read RDF/XML.
Note that script
elements may contain some comment information that may need to be stripped out to be legal input.
Yes, it's a good idea to extend the interface to support also non stream objects. We also need an interface for the serializer. I was thinking again about the name. What about Reader
and Writer
?
Reader Interface:
class Reader {
constructor (options) {
this.options = options
}
read (input) {
return source
}
}
Example:
readerInstance.read(readableFileStream).on('data', (quad) => {
// handle quads
}).on('end', () => {
// done
})
Writer interface:
class Writer {
constructor (options) {
this.options = options
}
write (output) {
return sink
}
}
Example:
writerInstance.write(writeableFileStream).import(quadStream).on('end', () => {
// done
})
I created a PR for the interfaces. Please add your comments there and vote +/-1 for it.
@bergos can we close this issue?
The
Source
interface doesn't define how an input stream like a file stream is feed to theSource
instance. A common way for this would be usefull. Because theSource
interface is not exclusive for parsers we should avoid methods like.parse
. We can define a new name like.transform
, which describes the input/output behavior. It would be also possible to align it to theSink
interface and name it.import
. That's how the interface should look like: