Closed b2d3c closed 5 months ago
We have a service that can make regular exports of specific (or all) graphs. We use it to make backups of data created within the Metaphactory platform. The code is here: https://github.com/swiss-art-research-net/sari-graph-backup.
It creates individual .trig files and commits them to a Git repo: https://github.com/swiss-art-research-net/bso-graph-backup
The advantage of this approach is that it can create incremental backups on a record level. The disadvantage is that it creates a lot of files. Maybe a single dump is preferred. In this case I would also use a format that retains the named graphs. e.g. TriG or NQuad.
What do you think?
Dear @tfrancart could you check the Proposal of Florian, and say if it's will be ok for you?
I changed the backup mechanism to produce a TTL and NQuad dump (and zip it). I can probably make it available through the platform. Or is there a suitable place where it could be uploaded automatically?
Great! For us doesn't matter. What the best partices ? Github?
Do it in the easy way for you. We can change it in futur if there is necessity.
Dump is currently running and should appear at
I am getting a 404 error when I click on both links. Is it because it is on the dev server ? In terms of access, I would suggest to create one additionnal editoral page "Data download" in the menu, with one paragraph of explanation and the link(s) to download the data.
Seems like the copying step didn't execute. I'll look into it. I copied it manually now and the links should be working
Dear @tfrancart ,
I have create a issue for the page "Data Download". See https://github.com/sapa/performing-arts-ch-templates/issues/548
Florian make the changes.
Seems like the copying step didn't execute. I'll look into it. I copied it manually now and the links should be working
Could test it a new time?
Could test it a new time?
I still have a 404 error when I click on both the links provided above.
I'm still not quite sure what's causing them to disappear. Are these available?
I'm still not quite sure what's causing them to disappear. Are these available?
Yes.
Is it possible for you to add some prefixes in the ttl file so that is it smaller and can be handled more easily ? I would suggest adding the following prefixes:
Also I see that the dump contains the "container" data, like this:
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/ns/prov#generatedAtTime> "2021-06-07T12:39:05.434Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/ns/prov#wasAttributedTo> <http://www.metaphacts.com/resource/user/migration> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/ldp#Resource> .
<http://data.performing-arts.ch/a/04052cd6-1ba2-48f3-8a51-95fefb798c7c/container> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
This is for traceability, but we don't really need this in the dump. Is there a way to exclude them ?
Do you know if there is a tool that can convert the TTL to prefixed? The way the dump is produced it only outputs full URIs.
I would keep the container data since it's crucial for the functioning of the platform, and if we're doing regular dumps I wouldn't exclude any information
Do you know if there is a tool that can convert the TTL to prefixed? The way the dump is produced it only outputs full URIs.
Any RDF library could do that (RDFlib if you are working with Python). On the command line I think you could use Jena command-line tools : https://jena.apache.org/documentation/tools/index.html A possibility could be:
riot
command : https://jena.apache.org/documentation/io/#command-line-toolsI would keep the container data since it's crucial for the functioning of the platform, and if we're doing regular dumps I wouldn't exclude any information
OK !
Thanks! Yes RDFLib can do that but with dumps of that size things tend to break. Just checked, Raptor might be able to do it too.
FYI @tfrancart
Thanks! Yes RDFLib can do that but with dumps of that size things tend to break. Just checked, Raptor might be able to do it too.
I added prefixes:
* https://www.dev.performing-arts.ch/assets/no_auth/dump.ttl.gz * https://www.dev.performing-arts.ch/assets/no_auth/dump.nq.gz
Well, the links are broken again... @fkraeutli ?
Switching to uploading to Git
Data gets uploaded here: https://github.com/sapa/performing-arts-ch-dump
Dump files are super small, I think they got corrupt or were not properly committed/pushed :
They cannot be opened
Ah, the part that adds the prefixes was still pointing to the old path, so it didn't catch the files. I changed it and am re-running the entire pipeline.
Prefixing step didn't complete, but upload seems correct now
No, files are still tiny and cannot be opened.
Looks alright at my end. They are stored with Git LFS. Might that be the issue?
Thanks, somehow a git pull does not retrieve the file. I downloaded the dump.ttl.gz directly and opened it, and it seems it still does not have prefixes included ?
Indeed. There was an error in the script. The prefixes are now present
Thanks, after looking at the file, a number of prefixes are incorrect, most of them because they are missing the final "/" in the prefix declaration :
Please also include these prefixes:
Thanks
Thanks for the feedback! I added the prefixes and am re-running the pipeline
Looks good. We may benefit from more prefixes:
This will be the last iteration on this :-) thanks
FYI @tfrancart , Florian will try to generate the DUMP without the user: http://www.metaphacts.com/resource/user/ informations
I am having a character encoding issue in the parsing of the dump.ttl file when parsed with Jena:
08:25:16.804 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 9399173, col: 9 ] Bad character encoding
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 9399173, col: 9 ] Bad character encoding
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:317)
at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
The line number does not correspond to any problematic character I can see. Could you confirm
Note : I am running on Linux here, and as far as I understand the file parsing is using UTF-8 as character encoding.
I cannot detect any encoding problems at that line either. Does it work if you remove it?
The file has been generated using Raptor RDF under Linux.
Hello @fkraeutli I will wait until new versions of the dumps are generated without user information and I will make further tests; a reduced file size could make things a little easier.
Hi @tfrancart, the latest dump should not include user information
Hi @tfrancart, the latest dump should not include user information
Thank you. On https://github.com/sapa/performing-arts-ch-dump the dump files date back from 3 weeks ago. I suppose I should see a more recent update date ?
If you're happy with the format I can deploy it to prod and run it weekly via a cronjob. Currently this is the Dev data
If you're happy with the format I can deploy it to prod and run it weekly via a cronjob. Currently this is the Dev data
But I don't know where to test this format and tell you if I am happy or not :-) ? the only dumps I can test are the ones on https://github.com/sapa/performing-arts-ch-dump
Dear @fkraeutli
Just to clarify : Baptitste and I expect the following:
Thanks !
Sounds good, I will move the dump scripts to production
@tfrancart The dump has now been produced from the prod database.
Dear @fkraeutli the file dump.ttl looks corrupted and cannot be parsed:
thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ tail dump.ttl
x:a40c5f36-bbd0-4b20-9d1e-262f3f8bba9b
crm:P82a_begin_of_the_begin "1973-11-22"^^xsd:date ;
crm:P82b_end_of_the_end "1973-11-22"^^xsd:date ;
a crm:E52_Time-Span ;
rdfs:label "22.11.1973" .
x:a40c64b0-de16-4b63-9e55-82b441bbbf33
<http://purl.org/ontology/olo/core#index> 0 ;
crm:P14_carried_out_by a:e68a6f18-5877-4bec-835a-0261dcf07f71 ;
crm:P2_has_typethomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$
As you can see, the file ends abruptly with crm:P2_has_type and no predicate or triple terminator. The parsing fails.
08:58:51.166 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 9544050, col: 17] Bad character encoding
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 9544050, col: 17] Bad character encoding
It seems the nq file as the same issue:
thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$ tail dump.nq
<http://data.performing-arts.ch/x/c96cfc8c-4184-4ad4-bee2-80ade72e1609> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/c9caeaaa-3db3-490b-8d92-ab559e2bb666> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E42_Identifier> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/c9caeaaa-3db3-490b-8d92-ab559e2bb666> <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> "600011994091601" <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by> <http://data.performing-arts.ch/a/3aa6aa64-61b1-406c-a254-a72607bc8148> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.performing-arts.ch/mulga> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://purl.org/ontology/olo/core#index> "0"^^<http://www.w3.org/2001/XMLSchema#integer> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd0884e0-43c3-489c-8bd3-e9a95101344f> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by> <http://data.performing-arts.ch/u/f8072a4a-a997-4276-904f-7fa773fb6738> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.performing-arts.ch/muecs> <http://data.performing-arts.ch/w/2b74f349-3269-466a-aea2-5ea81766b430/container/temp-context> .
<http://data.performing-arts.ch/x/cd9c59bd-a91a-4b56-bd76-56e2b48f2adf> <http://thomas@georges-courteline:~/sparna/00-Clients/SAPA/01-Modelisation/23-dumps$
As you can see at the end, it stops abruptly at <http://
. Parsing fails with this error
09:12:41.209 [main] ERROR org.apache.jena.riot -%PARSER_ERROR[kvp]- [line: 7732834, col: 81] Broken IRI (End of file)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 7732834, col: 81] Broken IRI (End of file)
I tried to re-download and re-unzip both files, without luck.
Thanks, I'll re-run the pipeline and check if the issue persists
Issue persists. I'll investigate
The issue occurs in the step that reformats the Turtle file to include prefixes. It crashes because the file to process is too large. We'll have to omit the prefixing step.
Thanks @fkraeutli for the investigation.
Dear @tfrancart, we will have a size problem to import in GraphDB and we have to see how goes the SHACL analysis. Perhaps, you can find others tools to add the prefix after the creation of the DUMP?
The issue occurs in the step that reformats the Turtle file to include prefixes. It crashes because the file to process is too large. We'll have to omit the prefixing step.
This is unfortunate. If we cannot deal with this issue ourselves, then every possible reuser of the data will also face the same issue. Could we try by splitting the export in smaller files ? e.g. create batches of 100.000 triples per file ? (or any other meaningful number to obtain say, 7 to 10 files at the end).
I suggest you:
I added a new volume with more space and am executing the pipeline there
@tfrancart Can you check the latest dump?
Hello, the dump looks fine and can be successfully loaded in GraphDB. It it not easy to do so because of the size of the file, but it works. I would now need to test a SHACL validation on the same data. These files can be advertised by adding a link from the About page, or a "developer's corner" page.
We would like to propose to the public a access to regular DUMP of the Data from our Plateforme ; https://github.com/sapa/performing-arts-ch-dump
Which Rythmus are possible ? Once per Day / per Week?
Cloud be propose it through Github ?
Under which Format :
SAPAwiki : https://wiki.sapa.swiss/index.php?title=SPAP/DUMP