Open donpellegrino opened 2 years ago
Hi,
the input of rdf2hdt is one rdf and not multiple.
As you say depending on the serialization different things might be possible. I think it is not the job of this HDT library to design the best way this can be done. This is not an issue of HDT but an issue of the respective RDF serialization.
What can help you though is the fact that you can concatenate two HDTs. This is the job of hdtCat. So you can compress two rdf files into HDT and cat them then together. (this functionality is only available in the hdt-java repo though)
WARNING: it is more efficient in time to
Hope it helps D063520
Thanks for the clarification. For first joining the RDF files and then compressing to HDT, I am trying to avoid creating the very large joined RDF file on disk. I tried piping the output of cat
over the files but it seems rdf2hdt won't accept stdin as an input and passing "/dev/stdin" gives a resulting HDT file having 0 triples. If there is some way to get HDT to take a pipe as input that would seem to enable pipes that handle joining multiple files into one input without also filling disk with a full duplication of all those triples into the single file.
How big is your file and how big are your resources? Because you need more or less a 1:1 ratio between the size of the RDF file and the amount of RAM to make the compression. HDT is very resources efficient except at indexing time where it is very memory hungry .....
"How big is your file" - A use case I am exploring is putting all of PubChemRDF (https://pubchemdocs.ncbi.nlm.nih.gov/rdf) into an HDT file. The full collection is 14,378,495,402 triples as per https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/void.ttl. I am exploring a subset that is ~ 40 GB across ~ 700 N-Triples gzipped files. I don't know how big that would be as a single N-Triples uncompressed text file. The creation of that intermediate file on disk is what I am trying to avoid.
"How big are your resources" - I am exploring the techniques on Intel DevCloud (https://devcloud.intel.com/oneapi/home/). It is a heterogeneous cluster, but the 12 nodes having the most RAM have 384 GB each. I also have a Cray EX series with some nodes having 1 TB of RAM, but I would prefer to identify a technique that was not limited by the RAM on a single node.
Perhaps one approach would be switching to a distributed memory model for scaling beyond the memory limitations on a single node. I see that Celerity (https://celerity.github.io/) could be a pathway to building in distributed memory utilization and that it is advertised as compatible with the Intel Dara Parallel C++ (DPC++) Compiler on the Intel DevCloud. If someone could point me to the right bit of code, I could explore the feasibility of that approach.
A distributed memory approach may address memory utilization limitations but would still leave open the problem of slow I/O and storage space needed to write out a single massive, concatenated, RDF file as an intermediate bulk load artifact.
A bit of Bash scripting would seem to allow for streaming in triples from multiple compressed files in multiple formats and allowing rdf2hdt to just eat N-Triples. I tried the following, but have not figured out why it doesn't work:
time find ~/data/pubchemrdf -name "*.ttl.gz" -exec gunzip --to-stdout --keep {} \; | rdf2hdt /dev/stdin pubchemrdf.hdt
My current suspicion is that rdf2hdt reads all of /dev/stdin, closes it, and then attempts to reopen it and finds it empty. But that is just a guess. I would love any suggestion on where to look to investigate what is happening there and see if a streaming approach could be made to work.
I am working in a sandbox with the following:
Please let me know if that is the right branch and codebase to work from.
Hi,
nice task : )
We have compressed on a single node (120Gb of RAM) wikidata (16 Billion triples) with the following approach. We converted Wikidata in n-triples and chunked it in pieces (of more or less 100Gb) so that we can compress them in HDT. Then we used hdtCat to cat them together.
You say that you want not to uncompress everything in one file. What you can do is that you uncompress some chunks, convert them to HDT and then you cat them together.
Having a bzipped or gzipped chunk or having an HDT file occupies more or less the same amount of space. So if you have 380 Gb of RAM I would advise to uncompress more or less 300Gb of n-triples and compress these to HDT1. Continue with the next 300Gb chunk and compress these to HDT2. Then you cat them together. And continue like this .... with your description this seams feasible ....
Does this make sense to you?
Salut D063520
My current suspicion is that rdf2hdt reads all of /dev/stdin, closes it, and then attempts to reopen it and finds it empty. But that is just a guess. I would love any suggestion on where to look to investigate what is happening there and see if a streaming approach could be made to work.
Note that this requires creating an HDT in one-pass, which is a feature hdt-java has, but hdt-cpp doesn't (checkout https://github.com/rdfhdt/hdt-cpp/issues/47). Short summary: it needs a dictionary first in order to assign IDs to values when compressing the triples and therefore, hdt-cpp reads the input file twice. A while ago, I made this PR to include one pass ingestion in the hdt-cpp, but since I don't know cpp, I was hoping somebody would step up ;)
If there is some way to get HDT to take a pipe as input that would seem to enable pipes that handle joining multiple files into one input without also filling disk with a full duplication of all those triples into the single file.
On POSIX computer, you can actually create a named pipe and sending twice the cat result for the two-pass, it won't cost the price of a new file,
mkfifo mypipe.nt
# send the cat twice to the pipe
(cat myfile1.nt myfile2.nt > mypipe.nt ; cat myfile1.nt myfile2.nt > mypipe.nt) &
rdf2hdt mypipe.nt myhdt.hdt
# don't forget to remove it ;)
rm mypipe.nt
Otherwise, the Java version contains with a one-pass parser, a directory parser, which can be better to parse multiple rdf files
The current rdf2hdt command-line interface only accepts one RDF input file and overwrites the output file. Therefore, it is not possible to use the rdf2hdt CLI to build up a single HDT file from many RDF input files. A work-around is to concatenate all the RDF files into a single input and then pass that to HDT. However, that can be inefficient when working with many files and only works for RDF formats that can be concatenated. RDF/XML for example would require another step to convert to N-Triples.
It would be useful for rdf2hdt to accept multiple input files in one run.
It would also be useful if rdf2hdt to optionally append to an existing HDT file instead of replacing it.
If there is already another usage pattern for bulk loading multiple RDF inputs into one HDT file, please let me know.