netwerk-digitaal-erfgoed / ld-workbench

A CLI tool for transforming large RDF datasets using pure SPARQL.
5 stars 1 forks source link

Results are not streamed to file #26

Closed wouterbeek closed 10 months ago

wouterbeek commented 10 months ago

Observation

When I run LD Workbench for a longer time, the result file remains empty. Only upon terminating the run does the file get written to disk. I do not understand how this works, since the file can be of arbitrary size.

Expected

I expect the file to be continuously written to disk while LD Workbench is running. There may be a buffer size, but that should be very small (kilobyte-level or smaller).

philipperenzen commented 10 months ago

The quads generated are streamed to a buffer here (writer), but the file will only be written when that stage emits an 'end' event (see code here). Thus the file will only be written once a stage has finished.

wouterbeek commented 10 months ago

@philipperenzen I do not really want to read TypeScript. I have tried to make my questions even more specific:

  1. What is the storage medium for the buffer? Is it disk, memory, or something else?
  2. If the storage medium is file-based, why is the buffer not identical to the file, i.e. what is the specific benefit of having two file-based buffers?
  3. If the storage is memory-based, what is the specific benefit of having the buffer in memory (given that memory is much more expensive than disk)?
  4. If the storage is memory-based, will a larger job not inherently result in more memory usage? And is this not a fundamental implementation flaw?
wouterbeek commented 10 months ago

Clarified by @philipperenzen and Laurens. Results are not written to memory, so no bottleneck there.