Allow "assume sorted" option

rdfio / rdf2smw

Convert RDF to Semantic MediaWiki facts in MediaWiki XML format, with a standalone commandline tool

MIT License

18 stars 3 forks source link

Allow "assume sorted" option #5

Open samuell opened 8 years ago

samuell commented 8 years ago

We could do a slightly different processing algorithm if we can assume the data is sorted (which should be much more efficient using a pure text sorting tool anyway, for n-triples files), which will require far less memory and probably be faster.

thiviyanT commented 6 years ago

By default, does rdf2smw (pre-release 0.6 version) have this option set to true? If so, it would explain why my triples were not imported as I expected.

Before running the rdf2smw script, one could always sort the data using the unix sort command. I have achieved this using the command below and the process was astonishingly fast, matter of seconds, on more than 95K triples.

cat triples.nt | sort -k2,2 -k1,1 > sorted.triples.nt.

@samuell Would it be possible to incorporate a similar unix command into rdf2smw? Do you think doing so would dramatically impact the performance of the code?

samuell commented 6 years ago

Thanks for the interesting suggestion @ThiviyanThanapalasingam !

I think including the unix sort command would make the software drastically more complex (because of interfacing between Go and C-code), and harder to maintain, though.

But since the sort command is so widely available, on Linux, Mac, and now even on Windows, with the Windows Subsystem for Linux (WSL), one could enable a workflow where the user first sorts the file using sort, and then runs rdf2smw. rdf2smw can implement a potentially faster algorithm if it can assume sorted input, or at least it could use a lot less memory. For example, aggregating triples per subject, which is done internally, the subjects will be already grouped together in the input, so it can finish each new wiki page as soon as all triples for a particular subject have been processed, instead of keeping all the triples and pages in memory until the end.

thiviyanT commented 6 years ago

I see. Thanks for the explanation @samuell. In that case, it would be a good idea to let the OS do the heavy lifting. The command (sort triples.nt -k2,2 -k1,1 > sorted.triples.nt) can be added in the README file, right under the Usuage header text. Do you think doing this solves the issue once and for all?

If you are happy with the changes that I have proposed, I would like to contribute to this project by implementing it. Please let me know what the protocol is for contributing (i.e. Do I work on the master branch and then send you a pull request?)

samuell commented 6 years ago

Thanks for the input @ThiviyanThanapalasingam ! I'll look at including that in the README shortly.

Reg. contributing, awsome, that is much welcome!

I think I should set up a develop branch, and have released code in master, for the future. So, if you start working, you could create a new develop branch in your repo, and I'll fix with the develop branch shortly.

samuell commented 6 years ago

and have released code in master, for the future

Recommended for Go-packages, since Go lacks an official dependency manager, and most people just pull in the master branch of libraries :)