trungdong / prov

A Python library for W3C Provenance Data Model (PROV)
http://prov.readthedocs.io/
MIT License
120 stars 44 forks source link

PROV-N deserialization? #122

Open MarcelPa opened 6 years ago

MarcelPa commented 6 years ago

Hello, I would like to know whether a PROV-N deserializer is somewhere on the implementation roadmap? If not, I would like to contribute that; in case it is of interest of course.

trungdong commented 6 years ago

Thanks, @MarcelPa. That'd be fab!

I've been thinking about doing this, but haven't found the time for it. @TomasKulhanek recently wrote an ANTLR grammar for PROV-N, which I believe can be used to build a PROV-N parser.

Are you interested in working on that? I'd be very happy to help with testing/integration when needed.

MarcelPa commented 6 years ago

Great, I would like to work on that then. The ANTLR grammar will surely be really helpful for that, thanks @trungdong (and @TomasKulhanek of course). I just forked the repo and will start to look into ANTLR; I hope to start pushing to the forked repo to the development branch starting next week. Will keep you posted :-)

trungdong commented 6 years ago

Excellent! Thanks a lot, @MarcelPa.

MarcelPa commented 6 years ago

Quick question regarding testing: is there a pattern on how to create test files that can be found under tests/rdf for example? My approach would be to just copy the rdf documents and translate them into provn docs step by step.

FYI: ANTLR works fine so far, I got rid of raising NotImplementedError to successfully run all test cases in my virtual python environment.

trungdong commented 6 years ago

There is an extensive suite of round-trip conversion tests that you can use right away. See test_json.py for an example. The following test code will do:

class RoundTripPROVNTests(RoundTripTestCase, AllTestsBase):
    FORMAT = 'provn'

BTW, could you develop from the dev branch, please? I've been reorganising the directory structure there and will update it in the next release. Cheers!

MarcelPa commented 5 years ago

Quite some time that I have pushed to my forked repo, therefore I am giving you an update via this issue: Right now, the antlr-grammar seems to be erroneous in some cases, like langtags. Unfortunaly, I am not an expert in grammars, but I hope to get my head around them soon. Right now, I think these errors can be solved by reordering the lexer rules of the grammar, I will test soon whether this helps.

trungdong commented 5 years ago

Thanks for the update, @MarcelPa. Unfortunately, I won't be of much help on ANTLR.

The PROV-N specs does use grammar rules, which you might find useful.

MarcelPa commented 5 years ago

Hey, after quite a while I finally had some spare time to spend for this. I modified the grammar a little bit (basically just reordered some rules), now it seems to work properly :-) I am down to 13 test cases which fail / error. Next step for me will be to rebase to the newest commit of the dev branch and keep on developing. As for now, failed test cases seem to be incorrectly parsed float values from typed literals. What would you expect to insert into the attributes of an expression? Something like

Literal(somevalue, datatype="xsd:float", langtag=someLangtag)

or parsed native value, like

float(somevalue)

I think I used those a little bit inconsequently right now. I will need to refactor this one way or another ;-)

trungdong commented 5 years ago

Thank you for the update and the work, @MarcelPa!

float(value) should work, I think. Do you have an example of a problematic case?

BTW, a Python float value is mapped to xsd:double by the package though.

MarcelPa commented 5 years ago

I do: running test_entity_with_multiple_attribute fails. Both outputs are almost identical: Parsed data from a debug print:

document
  prefix ex <http://example.org/>
  prefix ex_1 <http://example4.org/>

  entity(ex:emov, [ex:v_0="un lieu", ex:v_1="un lieu"@fr, ex:v_2="a place"@en, ex:v_3=1, ex:v_4=1, ex:v_5="1" %% xsd:short, ex:v_6="2" %% xsd:float, ex:v_7="1" %% xsd:float, ex:v_8="10" %% xsd:decimal, ex:v_9="1" %% xsd:boolean, ex:v_10="0" %% xsd:boolean, ex:v_11="10" %% xsd:byte, ex:v_12="10" %% xsd:unsignedInt, ex:v_13="10" %% xsd:unsignedLong, ex:v_14="10" %% xsd:integer, ex:v_15="10" %% xsd:unsignedShort, ex:v_16="10" %% xsd:nonNegativeInteger, ex:v_17="-10" %% xsd:nonPositiveInteger, ex:v_18="10" %% xsd:positiveInteger, ex:v_19="10" %% xsd:unsignedByte, ex:v_20="http://example.org" %% xsd:anyURI, ex:v_21="http://example.org" %% xsd:anyURI, ex:v_22='ex:abc', ex:v_23='ex:abcd', ex:v_24='ex_1:zabc', ex:v_25='ex_1:zabcd', ex:v_26="2019-03-27T12:52:02.266484" %% xsd:dateTime, ex:v_27="2019-03-27T12:52:02.266486" %% xsd:dateTime])
endDocument

versus the testcase data:

document
  prefix ex <http://example.org/>
  prefix ex_1 <http://example4.org/>

  entity(ex:emov, [ex:v_0="un lieu", ex:v_1="un lieu"@fr, ex:v_2="a place"@en, ex:v_3=1, ex:v_4=1, ex:v_5="1" %% xsd:short, ex:v_6="2" %% xsd:float, ex:v_7="1.0" %% xsd:float, ex:v_8="10" %% xsd:decimal, ex:v_9="1" %% xsd:boolean, ex:v_10="0" %% xsd:boolean, ex:v_11="10" %% xsd:byte, ex:v_12="10" %% xsd:unsignedInt, ex:v_13="10" %% xsd:unsignedLong, ex:v_14="10" %% xsd:integer, ex:v_15="10" %% xsd:unsignedShort, ex:v_16="10" %% xsd:nonNegativeInteger, ex:v_17="-10" %% xsd:nonPositiveInteger, ex:v_18="10" %% xsd:positiveInteger, ex:v_19="10" %% xsd:unsignedByte, ex:v_20="http://example.org" %% xsd:anyURI, ex:v_21="http://example.org" %% xsd:anyURI, ex:v_22='ex:abc', ex:v_23='ex:abcd', ex:v_24='ex_1:zabc', ex:v_25='ex_1:zabcd', ex:v_26="2019-03-27T12:52:02.266484" %% xsd:dateTime, ex:v_27="2019-03-27T12:52:02.266486" %% xsd:dateTime])
endDocument

The difference is noticable at ex:v_7="1.0" %% xsd:float, which will be parsed as a float but returned as ex:v_7="1" %% xsd:float.

So far, I did not notice any changes happening from float to double.

pohutukawa commented 4 years ago

@MarcelPa Just a quick question what the status of the PROV-N deserialiser is. It's been a good year, and it looked like things weren't far off.

MarcelPa commented 4 years ago

Oh my, I completely lost track of this issue, thanks for the reminder @pohutukawa ! I will rebase later today and give a status update; If I recall correctly, I was "stuck" editing the antlr prov-n grammar. Will keep you posted :-)

MarcelPa commented 4 years ago

I am back at finding out how antlr4 works (any help is appreciated!). For reasons I do not yet understand, langtags and some int_literals will fail to parse, which gives me 57 fails of 185 unit tests. Once I figure out how to fix that, PROV-N deserialization should near its completion.

pohutukawa commented 4 years ago

That looks promising! Even if there are some "glitches" as in the comment above (where a float is parsed to ex:v_7="1" %% xsd:float), I'd be fully happy, as the value and its type is still preserved, and only the formatting (to 1.0) is lost.

ChrisJMacdonald commented 3 years ago

Hi @MarcelPa, Wondering if you've had any progress on this deserializer? I'm wanting to work some more with Prov-n but seem quite limited without the ability to store and extract from Prov-n strings. I'm taking a bit of a look at the code and the tools to see if I could help but it's a little bit beyond me at this stage Thanks!

ChrisJMacdonald commented 3 years ago

I also found a mildly hacky way to convert in and out of Prov-n using the java ProvToolbox and provconvert, Saving the file as a .provn then using provconvert to spit it out as .json, and then using the python deserialiser to get it back as a ProvDocument. Luc Moreau had put some of his information up about the ANTLR3 grammer for prov too (here)

pohutukawa commented 3 years ago

We've been trying to use @MarcelPa's feature branch that can parse PROV-N with decent results so far. Though, it's not based on the current 2.0 version, yet, so that's a bit of a pity.

If the ANTLR3 grammar by Luc is more complete, would that be an option to move forward on? (Even though it may be more "sexy" to use a current ANTLR4 grammar.) After all, there is a antlr3_python_runtime Python module as well.

I'm just searching for ways to not create any inconsistencies between individual approaches, for the case that the ANTLR4 grammar may differ from the ANTLR3 one ...