ilri / dspace-statistics-api

A simple REST API to expose Solr view and download statistics for items in a DSpace repository.
GNU General Public License v3.0
14 stars 3 forks source link

KeyError: 'stats' #16

Open eulereadgbe opened 2 years ago

eulereadgbe commented 2 years ago

@alanorth, when I tried running python -m dspace_statistics_api.indexer, I received this error:

  File "C:\Users\euler\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\euler\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\dspace-statistics-api\dspace_statistics_api\indexer.py", line 223, in <module>
    index_views("items", "id")
  File "D:\dspace-statistics-api\dspace_statistics_api\indexer.py", line 56, in index_views
    results_totalNumFacets = res.json()["stats"]["stats_fields"][facetField][
KeyError: 'stats'

I tried this in a repository with no shards, and another with sharded statistics. Both repositories are using DSpace version 6.3 running on Windows 2019 Server and tested with Python versions 3.7.9, 3.9.1, and 3.9.10. What could I be missing?

alanorth commented 2 years ago

@eulereadgbe seems there is something wrong with current 1.4.4-dev. I just noticed the same bug in my test environment, but v1.4.3 works.

eulereadgbe commented 2 years ago

I downloaded the v1.4.3 tag but I still have the same error as the master and v6_x branch. I also tested 1.2.0, 1.4.2, and 1.4.3 releases.

(venv) E:\dspace-statistics-api-1.4.3>python -m dspace_statistics_api.indexer
Traceback (most recent call last):
  File "C:\Users\Administrator\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Administrator\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "E:\dspace-statistics-api-1.4.3\dspace_statistics_api\indexer.py", line 223, in <module>
    index_views("items", "id")
  File "E:\dspace-statistics-api-1.4.3\dspace_statistics_api\indexer.py", line 56, in index_views
    results_totalNumFacets = res.json()["stats"]["stats_fields"][facetField][
KeyError: 'stats'

I'm not sure if my issue is Windows OS related only. Sorry I don't have a non-Windows instance where I can test this.

alanorth commented 2 years ago

Yes actually I was mistaken, v1.4.4-dev is working here also (and looking at the few git commits since v1.4.3 I haven't changed anything other than updating dependencies).

So back to your problem. Are you using the built-in Solr 4.10.x that comes with DSpace 6.x, or a standalone Solr? This is the HTTP request that the indexer makes to Solr:

http://localhost:8080/solr/statistics/select?q=type%3A2+AND+id%3A%2F.%7B36%7D%2F&fq=-isBot%3Atrue+AND+statistics_type%3Aview&fl=id&facet=true&facet.field=id&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=id&stats.calcdistinct=true&shards=&rows=0&wt=json

Do you get a result if you paste this URL into your browser? Note that this assumes Solr is at http://localhost:8080/solr.

eulereadgbe commented 2 years ago

I checked the Solr version I'm using and it is version 4.10.4 and this Solr comes with the DSpace 6.3 installation. Below is the result of the Solr query:

{"responseHeader":{"status":500,"QTime":135,"params":{"stats.calcdistinct":"true","facet.field":"id","fl":"id","fq":"-isBot:true AND statistics_type:view","rows":"0","q":"type:2 AND id:/.{36}/","facet.limit":"1","shards":"","stats":"true","facet.mincount":"1","facet":"true","wt":"json","facet.offset":"0","stats.field":"id"}},"response":{"numFound":146786,"start":0,"docs":[]},"facet_counts":{"facet_queries":{},"facet_fields":{"id":["34d62239-a4bf-4f19-b662-64b1820b0adc",1529]},"facet_dates":{},"facet_ranges":{},"facet_intervals":{}},"error":{"msg":"Invalid shift value in prefixCoded bytes (is encoded value really an INT?)","trace":"java.lang.NumberFormatException: Invalid shift value in prefixCoded bytes (is encoded value really an INT?)
    at org.apache.lucene.util.NumericUtils.getPrefixCodedIntShift(NumericUtils.java:209)
    at org.apache.lucene.util.NumericUtils$2.accept(NumericUtils.java:497)
    at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:244)
    at org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:309)
    at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:712)
    at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:213)
    at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:600)
    at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:665)
    at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:213)
    at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:600)
    at org.apache.lucene.queries.function.valuesource.IntFieldSource.getValues(IntFieldSource.java:57)
    at org.apache.solr.handler.component.AbstractStatsValues.setNextReader(StatsValuesFactory.java:220)
    at org.apache.solr.handler.component.SimpleStats.getFieldCacheStats(StatsComponent.java:368)
    at org.apache.solr.handler.component.SimpleStats.getStatsFields(StatsComponent.java:326)
    at org.apache.solr.handler.component.SimpleStats.getStatsCounts(StatsComponent.java:290)
    at org.apache.solr.handler.component.StatsComponent.process(StatsComponent.java:79)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
    at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:348)
    at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:53)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:235)
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:690)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:373)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:745)
","code":500}}

It seems it is returning a code 500 error. Is there something wrong with my Solr instance? I tried this also with the other 6.3 repositories I'm maintaining and the results were the same, however, when I tried this in an instance running 6.4-SNAPSHOT, the results were ok.

Results from an instance running 6.4-SNAPSHOT:

{
  "responseHeader": {
    "status": 0,
    "QTime": 15,
    "params": {
      "stats.calcdistinct": "true",
      "facet.field": "id",
      "fl": "id",
      "fq": "-isBot:true AND statistics_type:view",
      "rows": "0",
      "q": "type:2 AND id:/.{36}/",
      "facet.limit": "1",
      "shards": "",
      "stats": "true",
      "facet.mincount": "1",
      "facet": "true",
      "wt": "json",
      "facet.offset": "0",
      "stats.field": "id"
    }
  },
  "response": {
    "numFound": 523,
    "start": 0,
    "docs": []
  },
  "facet_counts": {
    "facet_queries": {},
    "facet_fields": {
      "id": [
        "650e8f6d-7b7a-48f8-9a2e-cb4a176a0a2d",
        58
      ]
    },
    "facet_dates": {},
    "facet_ranges": {},
    "facet_intervals": {}
  },
  "stats": {
    "stats_fields": {
      "id": {
        "min": "0b860c68-4bfe-4462-900b-283b9114f449",
        "max": "fd17ff2e-b892-460f-89ee-ad0b09aea0ac",
        "count": 523,
        "missing": 0,
        "distinctValues": [
          "0b860c68-4bfe-4462-900b-283b9114f449",
          "3d2528d2-adf7-43ac-bb21-844df1b6cd38",
          "41b18e82-f7b1-48d1-8a90-add8b778b064",
          "428a60e1-6af0-41ad-89d9-65b7116b00a2",
          "54ff34d8-61ba-4211-9d49-261f9d4458dc",
          "58ac556a-9b69-4546-a612-b3f80c442a17",
          "5add0527-1b67-461a-94de-18cb2bb9eb82",
          "650e8f6d-7b7a-48f8-9a2e-cb4a176a0a2d",
          "66e28449-8c0e-466b-a008-8b3f83b41139",
          "6aedd0b2-1e67-41df-9f94-307f8f2147f9",
          "8420fe5b-5adf-4cd7-99d3-5f498547b8ba",
          "85b53292-ff49-4e9d-8cf7-591cf615dc8e",
          "89b64227-fb99-4477-ae49-1f767eaa3093",
          "8c7bd6ad-ce10-4402-a88c-7182383b55c2",
          "92ac4ab2-5387-4452-859a-4d375246ed3c",
          "95ef3da4-4f90-4792-842a-7a368787b37b",
          "9ffdf464-a5ce-48bb-9985-c3111e3cc613",
          "b1630edf-3ce7-4856-b0f1-e5dd926da20f",
          "ee4b9cf1-1a98-43b3-8988-a2199ec9f33a",
          "fd17ff2e-b892-460f-89ee-ad0b09aea0ac"
        ],
        "countDistinct": 20,
        "facets": {}
      }
    }
  }
}

I'll try dspace-statistics-api in this repository where the Solr query you sent is working.

alanorth commented 2 years ago

That's really strange. Seems to be something with Solr... I don't know, but this is weird:

java.lang.NumberFormatException: Invalid shift value in prefixCoded bytes (is encoded value really an INT?)

I see some results on Google for that, related to Elasticsearch and Solr, both of which are based on Lucene. Unfortunately I am not an expert on Solr so this is beyond me. I manage two DSpace 6.3 installations and test this locally in my dev environment as well and it works on all...

eulereadgbe commented 2 years ago

@alanorth , so I just tested this in a repository where your Solr query did not return an error. I just upgrade the Solr of this repository to use UUIDs, and I made sure that all INT IDs were migrated to UUIDs:

Connecting to http://localhost:8080/solr/statistics

=================================================================
        *** Statistics Records with Legacy Id ***

                   0    Bistream View
                   0    Item View
                   0    Collection View
                   0    Community View
                   0    Collection Search
                   0    Community Search
        --------------------------------------
                   0    TOTAL
=================================================================

                   0      TOTAL... (     1 sec; 0:00:01; DB cache:      0/       0; Docs:      0)

However, when I run python -m dspace_statistics_api.indexer, the console returned this message:

(venv) C:\Users\Administrator\Documents\dspace-statistics-api>python -m dspace_statistics_api.indexer
items: indexing views (page 1 of 11)
items: indexing views (page 2 of 11)
items: indexing views (page 3 of 11)
items: indexing views (page 4 of 11)
items: indexing views (page 5 of 11)
items: indexing views (page 6 of 11)
items: indexing views (page 7 of 11)
items: indexing views (page 8 of 11)
items: indexing views (page 9 of 11)
items: indexing views (page 10 of 11)
items: indexing views (page 11 of 11)
communities: indexing views (page 1 of 2)
Traceback (most recent call last):
  File "C:\Users\Administrator\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Administrator\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\Documents\dspace-statistics-api\dspace_statistics_api\indexer.py", line 224, in <module>
    index_views("communities", "owningComm")
  File "C:\Users\Administrator\Documents\dspace-statistics-api\dspace_statistics_api\indexer.py", line 105, in index_views
    psycopg2.extras.execute_values(cursor, sql, data, template="(%s, %s)")
  File "C:\Users\Administrator\Documents\dspace-statistics-api\venv\lib\site-packages\psycopg2\extras.py", line 1270, in execute_values
    cur.execute(b''.join(parts))
  File "C:\Users\Administrator\Documents\dspace-statistics-api\venv\lib\site-packages\psycopg2\extras.py", line 146, in execute
    return super().execute(query, vars)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for uuid: "90-unmigrated"
LINE 1: ...9),('dbd382f1-962d-410f-bb7c-89ac687599d8', 451),('90-unmigr...
                                                             ^

I don't understand why it is complaining about an unmigrated community ID when all the IDs were migrated to UUIDs. I also tried reindexing the Solr statistics but the result is still the same.

I tested this in Python versions 3.7.9 and 3.9.6.

Anyways, I am just curious and interested to try this API although I can't get past this indexing and I also found out that gunicorn doesn't run on Windows.

alanorth commented 2 years ago

Oh yes, I've dealt with this issue of unmigrated IDs a few years ago when we upgrade to DSpace 6. It's a known issue according to the DSpace 6 docs:

If a UUID value cannot be found for a legacy id, the legacy id will be converted to the form "xxxx-unmigrated" where xxxx is the legacy id.

I purged them all like this, for each statistics core if it is sharded:

$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'

That uses a regular expression to match unmigrated IDs: /.*unmigrated.*/

I also found out that gunicorn doesn't run on Windows.

Oh! :open_mouth: I haven't used Windows in twenty years so I have no idea. You will have to search for a WSGI server that runs on Windows. Sorry...