sapa / performing-arts-ch-templates

1 stars 3 forks source link

Regular DUMP #482

Closed b2d3c closed 5 months ago

b2d3c commented 9 months ago

We would like to propose to the public a access to regular DUMP of the Data from our Plateforme ; https://github.com/sapa/performing-arts-ch-dump

Which Rythmus are possible ? Once per Day / per Week?

Cloud be propose it through Github ?

Under which Format :

SAPAwiki : https://wiki.sapa.swiss/index.php?title=SPAP/DUMP

fkraeutli commented 8 months ago

We have a service that can make regular exports of specific (or all) graphs. We use it to make backups of data created within the Metaphactory platform. The code is here: https://github.com/swiss-art-research-net/sari-graph-backup.

It creates individual .trig files and commits them to a Git repo: https://github.com/swiss-art-research-net/bso-graph-backup

The advantage of this approach is that it can create incremental backups on a record level. The disadvantage is that it creates a lot of files. Maybe a single dump is preferred. In this case I would also use a format that retains the named graphs. e.g. TriG or NQuad.

What do you think?

b2d3c commented 8 months ago

Dear @tfrancart could you check the Proposal of Florian, and say if it's will be ok for you?

tfrancart commented 8 months ago
  1. Yes, prefer single files. Could it be just a matter of merging individual files as a post-process ?
  2. If possible, provide both a non-graph format (turtle) and a graph format (e.g. TriG or NQuads), otherwise, just the graph format is fine
  3. provide links to the downloadable file in the documentation page of the portal
fkraeutli commented 7 months ago

I changed the backup mechanism to produce a TTL and NQuad dump (and zip it). I can probably make it available through the platform. Or is there a suitable place where it could be uploaded automatically?

b2d3c commented 7 months ago

Great! For us doesn't matter. What the best partices ? Github?

Do it in the easy way for you. We can change it in futur if there is necessity.

fkraeutli commented 7 months ago

Dump is currently running and should appear at

tfrancart commented 7 months ago

I am getting a 404 error when I click on both links. Is it because it is on the dev server ? In terms of access, I would suggest to create one additionnal editoral page "Data download" in the menu, with one paragraph of explanation and the link(s) to download the data.

fkraeutli commented 7 months ago

Seems like the copying step didn't execute. I'll look into it. I copied it manually now and the links should be working

b2d3c commented 7 months ago

Dear @tfrancart ,

I have create a issue for the page "Data Download". See https://github.com/sapa/performing-arts-ch-templates/issues/548

Florian make the changes.

Seems like the copying step didn't execute. I'll look into it. I copied it manually now and the links should be working

Could test it a new time?

tfrancart commented 7 months ago

Could test it a new time?

I still have a 404 error when I click on both the links provided above.

fkraeutli commented 6 months ago

I'm still not quite sure what's causing them to disappear. Are these available?

tfrancart commented 6 months ago

I'm still not quite sure what's causing them to disappear. Are these available?

Yes.

Is it possible for you to add some prefixes in the ttl file so that is it smaller and can be handled more easily ? I would suggest adding the following prefixes:

Also I see that the dump contains the "container" data, like this:

<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/ns/prov#generatedAtTime> "2021-06-07T12:39:05.434Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/ns/prov#wasAttributedTo> <http://www.metaphacts.com/resource/user/migration> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/ldp#Resource> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .

This is for traceability, but we don't really need this in the dump. Is there a way to exclude them ?

fkraeutli commented 6 months ago

Do you know if there is a tool that can convert the TTL to prefixed? The way the dump is produced it only outputs full URIs.

I would keep the container data since it's crucial for the functioning of the platform, and if we're doing regular dumps I wouldn't exclude any information

tfrancart commented 6 months ago

Do you know if there is a tool that can convert the TTL to prefixed? The way the dump is produced it only outputs full URIs.

Any RDF library could do that (RDFlib if you are working with Python). On the command line I think you could use Jena command-line tools : https://jena.apache.org/documentation/tools/index.html A possibility could be:

I would keep the container data since it's crucial for the functioning of the platform, and if we're doing regular dumps I wouldn't exclude any information

OK !

fkraeutli commented 6 months ago

Thanks! Yes RDFLib can do that but with dumps of that size things tend to break. Just checked, Raptor might be able to do it too.

fkraeutli commented 6 months ago

I added prefixes:

b2d3c commented 6 months ago

FYI @tfrancart

Thanks! Yes RDFLib can do that but with dumps of that size things tend to break. Just checked, Raptor might be able to do it too.

I added prefixes:

* https://www.dev.performing-arts.ch/assets/no_auth/dump.ttl.gz

* https://www.dev.performing-arts.ch/assets/no_auth/dump.nq.gz
tfrancart commented 6 months ago

Well, the links are broken again... @fkraeutli ?

fkraeutli commented 6 months ago

Switching to uploading to Git

fkraeutli commented 6 months ago

Data gets uploaded here: https://github.com/sapa/performing-arts-ch-dump

tfrancart commented 6 months ago

Dump files are super small, I think they got corrupt or were not properly committed/pushed :

image

They cannot be opened

fkraeutli commented 6 months ago

Ah, the part that adds the prefixes was still pointing to the old path, so it didn't catch the files. I changed it and am re-running the entire pipeline.

fkraeutli commented 6 months ago

Prefixing step didn't complete, but upload seems correct now

tfrancart commented 6 months ago

No, files are still tiny and cannot be opened.

fkraeutli commented 6 months ago

Looks alright at my end. They are stored with Git LFS. Might that be the issue?

Screenshot 2024-04-05 at 15 33 07
tfrancart commented 6 months ago

Thanks, somehow a git pull does not retrieve the file. I downloaded the dump.ttl.gz directly and opened it, and it seems it still does not have prefixes included ?

image

fkraeutli commented 6 months ago

Indeed. There was an error in the script. The prefixes are now present

tfrancart commented 6 months ago

Thanks, after looking at the file, a number of prefixes are incorrect, most of them because they are missing the final "/" in the prefix declaration :

Please also include these prefixes:

Thanks

fkraeutli commented 6 months ago

Thanks for the feedback! I added the prefixes and am re-running the pipeline

tfrancart commented 6 months ago

Looks good. We may benefit from more prefixes:

This will be the last iteration on this :-) thanks

b2d3c commented 6 months ago

FYI @tfrancart , Florian will try to generate the DUMP without the user: http://www.metaphacts.com/resource/user/ informations

tfrancart commented 6 months ago

I am having a character encoding issue in the parsing of the dump.ttl file when parsed with Jena:

08:25:16.804 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 9399173, col: 9 ] Bad character encoding
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 9399173, col: 9 ] Bad character encoding
    at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
    at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
    at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
    at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:317)
    at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
    at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
    at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
    at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
    at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
    at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)

The line number does not correspond to any problematic character I can see. Could you confirm

  1. on which OS the files are generated (Linux, Windows, Mac) ?
  2. if you have any idea of the specific encoding used to write the file ?

Note : I am running on Linux here, and as far as I understand the file parsing is using UTF-8 as character encoding.

fkraeutli commented 6 months ago

I cannot detect any encoding problems at that line either. Does it work if you remove it?

The file has been generated using Raptor RDF under Linux.

tfrancart commented 5 months ago

Hello @fkraeutli I will wait until new versions of the dumps are generated without user information and I will make further tests; a reduced file size could make things a little easier.

fkraeutli commented 5 months ago

Hi @tfrancart, the latest dump should not include user information

tfrancart commented 5 months ago

Hi @tfrancart, the latest dump should not include user information

Thank you. On https://github.com/sapa/performing-arts-ch-dump the dump files date back from 3 weeks ago. I suppose I should see a more recent update date ?

fkraeutli commented 5 months ago

If you're happy with the format I can deploy it to prod and run it weekly via a cronjob. Currently this is the Dev data

tfrancart commented 5 months ago

If you're happy with the format I can deploy it to prod and run it weekly via a cronjob. Currently this is the Dev data

But I don't know where to test this format and tell you if I am happy or not :-) ? the only dumps I can test are the ones on https://github.com/sapa/performing-arts-ch-dump

tfrancart commented 5 months ago

Dear @fkraeutli

Just to clarify : Baptitste and I expect the following:

  1. add the few new prefixes mentionned in the comment above
  2. Remove the user information from the dump
  3. Run the dump script on the prod database
  4. Publish the results of the dumps on https://github.com/sapa/performing-arts-ch-dump
  5. Then, I will do the following:
    1. Test that the dump imports sucessfully in GraphDB
    2. Test that the dump can be validated with our SHACL specification

Thanks !

fkraeutli commented 5 months ago

Sounds good, I will move the dump scripts to production

fkraeutli commented 5 months ago

@tfrancart The dump has now been produced from the prod database.

tfrancart commented 5 months ago

Dear @fkraeutli the file dump.ttl looks corrupted and cannot be parsed:

thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ tail dump.ttl
x:a40c5f36-bbd0-4b20-9d1e-262f3f8bba9b
    crm:P82a_begin_of_the_begin "1973-11-22"^^xsd:date ;
    crm:P82b_end_of_the_end "1973-11-22"^^xsd:date ;
    a crm:E52_Time-Span ;
    rdfs:label "22.11.1973" .

x:a40c64b0-de16-4b63-9e55-82b441bbbf33
    <http://purl.org/ontology/olo/core#index> 0 ;
    crm:P14_carried_out_by a:e68a6f18-5877-4bec-835a-0261dcf07f71 ;
    crm:P2_has_typethomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ 

As you can see, the file ends abruptly with crm:P2_has_type and no predicate or triple terminator. The parsing fails.

08:58:51.166 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 9544050, col: 17] Bad character encoding
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 9544050, col: 17] Bad character encoding

It seems the nq file as the same issue:

thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ tail dump.nq
<http://data.performing-arts.ch/x/c96cfc8c-4184-4ad4-bee2-80ade72e1609> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/c9caeaaa-3db3-490b-8d92-ab559e2bb666> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E42_Identifier> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/c9caeaaa-3db3-490b-8d92-ab559e2bb666> <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> "600011994091601" <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by> <http://data.performing-arts.ch/a/3aa6aa64-61b1-406c-a254-a72607bc8148> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.performing-arts.ch/mulga> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://purl.org/ontology/olo/core#index> "0"^^<http://www.w3.org/2001/XMLSchema#integer> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by> <http://data.performing-arts.ch/u/f8072a4a-a997-4276-904f-7fa773fb6738> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.performing-arts.ch/muecs> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ 

As you can see at the end, it stops abruptly at <http://. Parsing fails with this error

09:12:41.209 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 7732834, col: 81] Broken IRI (End of file)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 7732834, col: 81] Broken IRI (End of file)

I tried to re-download and re-unzip both files, without luck.

fkraeutli commented 5 months ago

Thanks, I'll re-run the pipeline and check if the issue persists

fkraeutli commented 5 months ago

Issue persists. I'll investigate

fkraeutli commented 5 months ago

The issue occurs in the step that reformats the Turtle file to include prefixes. It crashes because the file to process is too large. We'll have to omit the prefixing step.

b2d3c commented 5 months ago

Thanks @fkraeutli for the investigation.

Dear @tfrancart, we will have a size problem to import in GraphDB and we have to see how goes the SHACL analysis. Perhaps, you can find others tools to add the prefix after the creation of the DUMP?

tfrancart commented 5 months ago

The issue occurs in the step that reformats the Turtle file to include prefixes. It crashes because the file to process is too large. We'll have to omit the prefixing step.

This is unfortunate. If we cannot deal with this issue ourselves, then every possible reuser of the data will also face the same issue. Could we try by splitting the export in smaller files ? e.g. create batches of 100.000 triples per file ? (or any other meaningful number to obtain say, 7 to 10 files at the end).

I suggest you:

  1. publish valid dump data files without prefixes if this is causing the issue. Then I can make further tests with them, and maybe suggest ways to add prefixes to them.
  2. investigate how to give more memory to raptor, or split the dump files into chunks
fkraeutli commented 5 months ago

I added a new volume with more space and am executing the pipeline there

fkraeutli commented 5 months ago

@tfrancart Can you check the latest dump?

tfrancart commented 5 months ago

Hello, the dump looks fine and can be successfully loaded in GraphDB. It it not easy to do so because of the size of the file, but it works. I would now need to test a SHACL validation on the same data. These files can be advertised by adding a link from the About page, or a "developer's corner" page.