rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 68 forks source link

System.out.println() output #189

Closed KonradHoeffner closed 1 year ago

KonradHoeffner commented 1 year ago

I am benchmarking several RDF libraries with a benchmark suite that creates CSV output, but HDT Java creates several lines of output that destroy the CSV files. It seems as if this cannot be disabled via logging as there are written directly via System.out.println. Would it be possible to disable those statements or use a logging library instead?

Here is the output of HDTManager.loadIndexedHDT:

Predicate Bitmap in 114 ms 686 us
Count predicates in 493 ms 943 us
Count Objects in 176 ms 773 us Max was: 1414214
Bitmap in 25 ms 70 us
Object references in 992 ms 947 us
Sort object sublists in 375 ms 683 us
Count predicates in 116 ms 446 us
Index generated in 1 sec 687 ms 321 us
Index generated and saved in 2 sec 489 ms 103 us

There are several System.out.println statements in https://github.com/rdfhdt/hdt-java/blob/master/hdt-java-core/src/main/java/org/rdfhdt/hdt/hdt/impl/HDTImpl.java.

D063520 commented 1 year ago

In this class there is already a logger. We can just use this for these messages. The only thing I'm not sure about is if the java command line will not break this way. My best guess is that it is used for that reason.

Can you explain better your use case, I'm still a bit confused why you use the sout to create your CSV. Can you not write to a file?

KonradHoeffner commented 1 year ago

Sure, my use case is taking an existing Benchmark suite for RDF libraries without HDT and extending it with measurements of RDF libraries with HDT because we are writing an HDT library in Rust and want to know how it compares in performance to the existing libraries both with and without HDT.

If you are interested, you can see the plots for the Jupyter Notebook here.

Because the libraries use many different programming languages, the benchmark is structured like this: There is a python program that recognizes different tools and tasks, and depending on the tool and task selected, it runs this tool, which is in one of the subfolders, and runs it multiple times for each dataset size. The benchmarking subprogram for that tool in that subfolder responds with printing one line of CSV output to stdout and the Python program merges all those together in one CSV file for each tool. Then you can start Juypter Lab and generate the plots and see the scores.

I could modify everything to use Files instead but it would be a large amount of refactoring and does not fit well because each program is run many times and the benchmarking suite combines all the different results.

P.S.: Oh hi Dennis, it seems you are everywhere :-)

D063520 commented 1 year ago

@ate47 thank you for this quick fix @KonradHoeffner you can checkout dev compile it and do your tests ....

mielvds commented 1 year ago

an HDT implementation in Rust 😍 (so also python?)

KonradHoeffner commented 1 year ago

@mielvds: Yes! I was looking for an HDT library for Rust last year and was surprised that there wasn't any on crates.io. However @timplication had one on GitHub and he allowed me to continue it under an open license.

You can try it at https://github.com/konradhoeffner/hdt or find it https://crates.io/crates/hdt. It's still under development though and doesn't have all the functions of the CPP and Java versions, i.e. only the default triple order and default HDT variant. But all the triple pattern querying and the indexes are there.

I haven't directly interfaced Rust with Python though, the Python script just executes the binary.