pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
77 stars 17 forks source link

task validation-sqlite should require locally installed solr #271

Open nichtich opened 1 year ago

nichtich commented 1 year ago

In my error logs I found task validation-sqlite also uses Solr. By now I though the processing steps are clearly separated:

  1. analyses: MARC/PICA => CSV/SQLite
  2. solr: CSV/SQLite => Solr-Index

A clear separation between these two steps would be welcome to better control updates.

pkiraly commented 1 year ago

Sorry for the inconvenience. You are right, this involves Solr, but it is not the main index. The tool creates a special index (Solr core) to contain recrd ID, group IDs and error IDs. Group IDs are the IDs of the individual libraries, and errorIDs are the IDs of individual errors. The errors have 4 levels: category, type, error and location. In the issues tab in order to be able to calculate the numbers belong to these levels and the group I had to find some aggregation method. It turned out that simple SQL based solution works well up to some million records, but it does not work for 71 million. So I tried to improve SQL, I tried if MySQL performs better than SQLite, I also tried to solve it in R, and finally it was Solr that worked. The name sqlite step now is misleading, it should be named as "aggregation" or "persisting" or "postprocessing". This step also help a lot in searching for errors, and also in downloading the record IDs belong to a particular error level.

nichtich commented 1 year ago

The analysis task should better create an (in-memory or temporary file) Lucene index so no additional service is needed to be running to do analysis.

pkiraly commented 1 year ago

The problem is that the data does not necessary fit into the memory. Lucene would be an option. I prefer Solr, because later the real Solr indexing reuse the validation Solr index, so the values stored there are merged into the main index. What is the drawback you see with the current solution?

nichtich commented 1 year ago

Sorry, I don't fully understand: does qa-catalogue write anything temporarily into Solr or is grouping with Solr based on data that is written to Solr anyway? This seems to be related to #210 and #249: which task/command does require which input files and/or solr index/records and writes which files and/or solr records?

I created https://github.com/pkiraly/qa-catalogue/wiki/Data-flow to start documentation.