omeka-s-modules / DspaceConnector

Connect to / import from a Dspace repo into Omeka S
GNU General Public License v3.0
1 stars 3 forks source link

Support for DSpace subcommunities #43

Closed mjlassila closed 7 years ago

mjlassila commented 7 years ago

It is quite common that DSpace instances have deeply hierarchical community/collection structure but currently connector returns only top-level collections and communities in the DSpace installation and subcommunities are not being retrieved. DSpace REST API supports getting subcommunity information, so it would be nice if the connector could retrieve also subcommunities. Preserving community/collection hierarchy is not likely important.

patrickmj commented 7 years ago

Looks like the test instance I was working from doesn't have subcommunities. Thanks for the heads-up.

patrickmj commented 7 years ago

@mjlassila I'm having a hard time nailing down the details on this. New tests that I've been doing look like the call to get all the communities and their collections seem to work, but that's not looking at the entire set of data available, just checking through a bunch of examples. From my results, all the communities and collections get pulled in via the /collections and /communities endpoints.

Could you point me to an example of where this isn't working as desired?

mjlassila commented 7 years ago

Thanks for investigating this issue. I'll try to provide a bit more details here.

For an example, in our repository we have a top-level community Historical Maps https://jyx.jyu.fi/dspace/handle/123456789/6533

Inside of this community, there is one subcommunity and three collections.

The majority of items reside in the subcommunity https://jyx.jyu.fi/dspace/handle/123456789/24994

Inside of this subcommunity, there have four subcommunities and ten collections.

Each of these subcommunities might have collections inside, so for an example, City maps subcommunity https://jyx.jyu.fi/dspace/handle/123456789/20329, there are have two collections.

Currently, the call to the REST endpoind in IndexController.php retrieves only the top level of community-collection structure. In many repositories, such as ours, the community-collection structure is deeply nested. Here is our repository community-collection structure in full https://jyx.jyu.fi/dspace/community-list.

If one modifies the call in IndexController.php to include also subcommunities (expand=collections,subCommunities, the call returns topmost subcommunities and collections, but not the underlying hierarchy.

To get to the hierarchy, one must make expand=collections,subCommunities calls to every subcommunity individually. It is not sufficient to call expand=all at top community level, as it only returns the topmost subcommunities.

I put some example data available at https://www.dropbox.com/s/auy2elpb3t75edk/example-dspace-rest-data.zip?dl=1

patrickmj commented 7 years ago

Thanks for the detailed info, and the data too dig through.

Unfortunately, this still leaves me confused about what's going on. According to the DSpace API documentation, communities should return all the communities, and top-communities would return only the top level ones. That's also what I've been seeing in my latest looks at our local DSpace instance. But I'll keep exploring.

On the hierarchies, do you want to preserve the hierarchies of communities? It sounded earlier like you didn't, but just want to make sure what the desired outcome is.

It might also help to know what DSpace version you are using.

mjlassila commented 7 years ago

It seems that the culprit might be in our infrastructure. The DSpace instance (DSpace 6.2) I have been running my tests against, returns the data in the form I described in my previous comment. This instance has the same data as our production instance -- but our other test instance (DSpace 6.0), with toy data with deep hierarchies, communities indeed returns all the communities!

I'll investigate whether there is a bug in DSpace 6.2 or in our data which causes this problem and report back.

mjlassila commented 7 years ago

DSpace 6 REST documentation didn't mention the limit parameter, which controls the items per response, but as in DSpace 5, this parameter is in effect also in DSpace 6. Including the limit parameter with a high value to the collections call resolved the problem. It might be better to solve this by using a low initial limit and offset, as it is done in importCollection function, but this quick and dirty solution was good enough in our case :)

mjlassila commented 7 years ago

It came to my mind that if the backward compatibility to DSpace 5 REST API is not important, there is hierarchy API endpoint available in DSpace 6 which returns a simplified representation of whole community/collection structure. Compared to communities&expand=collections call, it is much faster

patrickmj commented 7 years ago

Thanks much for your digging around on this. I like your idea of using limit and offset, and will also try out the hierarchy approach to compare. That does sound faster. I'll just want to poke around in the results a bit.

patrickmj commented 7 years ago

@mjlassila Thanks again. I went with the offset/limit approach, and made it configurable. It makes it slower, but it sounds like it might be more generally helpful.

mjlassila commented 6 years ago

Thanks! I noticed that the limit setting was missing from the import form and therefore Omeka gave an error message:

Notice: Undefined index: limit in /var/www/html/modules/DspaceConnector/src/Controller/IndexController.php on line 32

The changes needed are in https://github.com/mjlassila/DspaceConnector/commit/c9eb91cd4cf7b36e0a83517750c82ff68e8698b2

I also increased the timeout for communities?expand=collections because the default 10 second timeout was too short even when the limit was set under 100 items. mjlassila/DspaceConnector/IndexController/L80

patrickmj commented 6 years ago

Those changes look good to me. Could you make them a pull request to help with the automatic checking and other management?