rdfhdt / hdt-cpp

HDT C++ Library and Tools
117 stars 65 forks source link

Add option to ignore error instead of throwing error #260

Open mhoangvslev opened 2 years ago

mhoangvslev commented 2 years ago

While working with dirty data, I realised that being able to skip bad rows when parsing RDF is very useful. This feature is suggested in issue #117 but was met with strong opposition. I would like to bring that up once more time, in hope that mentality might have changed since.

The program should give the option to warn-instead-of-error for these reasons:

  1. I know that the errors is minor and am willing to drop those faulty triples.
  2. I want to go all the way through first, get the list off all line with error, bulk-edit my huge RDF (579GB) instead of fix it one by one. When the faulty triples are at the end of the file, it's just painful and takes a lot of dev-time.
mielvds commented 2 years ago

I think this is something for the SERD parser, rather than HDT, no?

mhoangvslev commented 2 years ago

From the user's pov, I don't see the option for it. Can you give me hint?

drobilla commented 2 years ago

serd already has a lax parsing mode for roughly this purpose, although (as you might expect) things can go horribly wrong with syntactically invalid Turtle or TriG documents and drop a ton of data on the floor. It works fine for line-based formats like NTriples and NQuads though.

mhoangvslev commented 2 years ago

Let's consider my second point. I am willing to fix the bug and I want to have the list of the bugs to fix instead of launch-fix-launch.

drobilla commented 2 years ago

@mhoangvslev You could use serdi on the command line to strip the bad triples out yourself before loading it. It uses the same parser, so should encounter the same errors as hdt-cpp but be much quicker to use as a tool for this. With lax parsing (-l) it should print all the errors encountered in one run.

I usually do this from a text editor with a compilation mode that understands GCC warning syntax (vim, emacs, etc etc) so you can jump immediately to each error.