proycon / codemetapy

A Python package for generating and working with codemeta
https://codemeta.github.io/
GNU General Public License v3.0
24 stars 5 forks source link

graph creation of many entries fails with rekursion depth error #35

Closed broeder-j closed 1 year ago

broeder-j commented 1 year ago

I am not sure if this is related to the number of jsonld files or the content. Number of jsonLD files >1200. Generation of subgraphs of this data is possible so I assume the quantity is the issue also because of an recursion error. But I have yet not tried out if I can create a subgraph for really ALL these files or not.

codemetapy --graph codemeta_results/git_*/*/*/codemeta_*.json > graph.json
...
  File "/home//work/git/codemetapy/codemeta/serializers/jsonld.py", line 182, in embed_items
    return embed_items(itemmap[data[idkey]], itemmap, copy(history))
  File "/usr/lib/python3.8/copy.py", line 72, in copy
    cls = type(x)
RecursionError: maximum recursion depth exceeded while calling a Python object

So the current graph serializer does not scale. I had this for different jsonld file sets where it fails after 2000 files or so and errors after different files.

The default python recursion depth is around 1000.

broeder-j commented 1 year ago

I have not looked into the code itself, but either one has to get rid of the rekursion, or one could first serialize batches and combine these in the end if combination of json graph files scales.

proycon commented 1 year ago

This indeed seems a bug but should not be related to the number of files. It is failing when trying to expand the JSON-LD representation because of some cycle in the graph (even though the code protects against that, but that's where I guess something is going wrong). I'd be interested in seeing exactly the file where it fails and I wonder if it can be pinpointed to a single file even.

There are some left-over debug statements still left in the code which you could enable to see where it fails: https://github.com/proycon/codemetapy/blob/master/codemeta/serializers/jsonld.py#L178 . If you send me the input files I can try reproduce it.

The default python recursion depth is around 1000.

and I intend not to get anywhere near that ;) That'd be bad design.

one could first serialize batches and combine these in the end if combination of json graph files scales.

that could work yes

broeder-j commented 1 year ago

The files from the last printout

Adding json-ld file from filex/codemeta_harvested.json to graph
    Found main resource with URI xx/snapshot

Do not fail if serialized to a graph alone or with a few files, so it really depends on the collected history. I will investigate this and let you know.

proycon commented 1 year ago

I have implemented some fixes ( to be released in 2.4.0) that should hopefully prevent this bug, although it may still result in big serialisations as codemetapy expands things quite eagerly.