Closed Stiksels closed 5 months ago
Hi @Stiksels ,
Please correct me if I am wrong. As I can see you are accessing Google Big Query, storing intermediate results as CSV and then generating the RDF with morph-kgc.
Is there any reason for that page size? I think it is too small you can probably increase it to 100k or more.
Also, have you checked writing the triples to a file instead of fuseki? Just to confirm the problem is not the interaction with fuseki.
I also see that you are using Big Query. We could create a connector to it with python-bigquery-sqlalchemy.
hi @arenas-guerrero-julian ,
thank you for the fast reply. I experimented with the pageSize; with the batch size of 1000 rows, the transformer job would run out of memory after about an hour. With an increased batch size of 5000 rows, the job would run out of memory faster (~ equal amount of triples generated). I will try to set the pagesize to 100K and see the impact.
Also a BigQuery connector could be very interesting for our implementation
Kind regards, Stan
If possible, please, try also writing the triples to disk rather than to fuseki.
With the increased batch size, the transformer job instantly runs into state OOMKilled when trying to write to Fuseki...
INFO:root:23 mapping rules retrieved. │
│ INFO:root:Mapping partition with 23 groups generated. │
│ INFO:root:Maximum number of rules within mapping group: 1. │
│ INFO:root:Mappings processed in 1.814 seconds. │
│ INFO:root:Number of triples generated in total: 2060739.
our platform expects all transformer jobs to write to fuseki
I am not sure if it is fuseki or morph-kgc causing the OOM. I would need to take a closer look. Write to me julian.arenas.guerrero@upm.es if I can help.
Thanks for the input, @arenas-guerrero-julian . We'll continue experimenting with memory allocation, batch size and fuseki optimization.
Hi,
We're seeing an issue when running our transformation code, where morph_kgc is involved. In short; we're iterating through a dataset of approximately 1mio results by processing relatively small batches (1000 rows) . Every iteration involves generating a graph_store from the batch of results. While running the transformation code, we see a memory increment at this step of around 20 MiB. After processing 500K results, the transformation job reaches it's memory limit and fails with an out-of-memory error.
I'm trying to optimize the code so that garbage collection is ensured and memory is freed up after every processed batch, but it doesn't seem to work.
Memory profiler logs:
morph_kgc memory increment example