hbz / lobid-organisations

Transformation, web frontend, and API for lobid-organisations
http://lobid.org/organisations
Eclipse Public License 2.0
13 stars 3 forks source link

Implement regular DBS updates #94

Closed acka47 closed 6 years ago

acka47 commented 8 years ago

To be discussed with DBS colleagues. Questions (amongst others):

acka47 commented 7 years ago

I will speak to Ira soon about this. There are different options – short-term and mid-/long-term.

Short-term: Notify DBS regarding changes

Mid- to long-term: move DBS-Stammdaten management to another system

acka47 commented 7 years ago

I discussed with Ira that @dr0i will get access to the DBS database. If Gerhard is back in September he will instruct @dr0i.

acka47 commented 7 years ago

We will have a meeting with DBS in October where we will probably get access to the database so that we can set up regular exports of DBS data.

acka47 commented 7 years ago

Renamed the issue to set up regular DBS updates. We now have access to the database. @dr0i had some questions regarding the next steps.

The current version of DBS export we use for transformation can be found at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip (see https://github.com/hbz/lobid-organisations/blob/master/README.textile#deployment).

I think we made sure in the past that we want the DBS export as near to the original data as possible. Thus, the column names in the csv should be named the same as those in the data base. In the past we exported the following information (though we don't use it all): inr, iso, bib_typ, nam, plz, ort, str, stk_2007, sbi, isil, tvw, tel, fax, ema, url, opa, sta, oef, typ_text, utr_text, gro_text, leitung

fsteeg commented 7 years ago

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

acka47 commented 7 years ago

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

I communicated about those and especially other changes with the DBS colleagues. So the DBS source should be much better than what we currently have in dbs.zip. No need to merge.

dr0i commented 6 years ago

@acka47 for your initial test use sol@weywot2. That's our staging environment.

acka47 commented 6 years ago

We get nearly all relevant data with SELECT * FROM dbs.adresse;. I put the row names and two examples rows at https://gist.github.com/acka47/67199ec585f1f6adeea3176b0d69dd76.

Filter out active libraries with status_2_id = 1: We only want to import the active libraries to lobid, this means to only get rows with value 1 in status_2_id.

We will have to adjust the transformation a bit as column names differ from the current dbs.csv file. Here is a comparison old -> new:

Missing field "Unterhaltsträger"

What is missing in dbs.adresse is information on Unterhaltsträger. This is in another table dbs.utr_von where a DBS ID is associated with a Unterhaltsträger ID (1-15, see this gist). So, @dr0i was right that we have to join different tables, at least in this case.

*Additional fields

acka47 commented 6 years ago

I tried to join the tables without any previous SQL experience and finally this did work:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr;
acka47 commented 6 years ago

Here is the query with filtering out non-active libraries:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
acka47 commented 6 years ago

As we have to get the newlines in a field (oeffnungszeiten, name) out I adjusted the MySQL query like this:

SELECT 
dbs.adresse.inr, 
dbs.adresse.iso, 
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.tel,
dbs.adresse.fax,
dbs.adresse.email,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;

With this, we also only get the fields we need.

acka47 commented 6 years ago

Here is another small update to the query. As we had problems with the column being named name, we just call it nam again:

SELECT 
dbs.adresse.inr, 
dbs.adresse.iso, 
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.tel,
dbs.adresse.fax,
dbs.adresse.email,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
acka47 commented 6 years ago

Here is another small adjustment to the sql query as response to the problem with added "email" described in https://github.com/hbz/lobid-organisations/pull/393#issuecomment-346291861:

SELECT 
dbs.adresse.inr, 
dbs.adresse.iso, 
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.fax,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
acka47 commented 6 years ago

Another update to the query was needed, see https://github.com/hbz/lobid-organisations/pull/393#issuecomment-346624698:

SELECT 
dbs.adresse.inr, 
dbs.adresse.iso, 
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel AS isil,
dbs.adresse.fax,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
dr0i commented 6 years ago

With the sql-statement you provided it's in any case easy to get the data. Just @acka47 give me offline the passwords etc. , will crontab a regular update.

acka47 commented 6 years ago

Don't we run regular updates already? Regarding user name & password, you already received them in an email from G.S. with time stamp Tue, 17 Oct 2017 09:06:10 +0200

dr0i commented 6 years ago

No Sir no regular updates yet. Postponing this until May till it's clear if the data conforms to DS-VGO.

dr0i commented 6 years ago

The dbs.csv is sql-ed via my user at mysql-server. It is thn copied to the proper location of quaoar1 and weywot2. See crontab in weywot1, as usual. @fsteeg I guess the whole web app must be restarted to use the new data, right? @acka47 How often is this necessary?

fsteeg commented 6 years ago

@fsteeg I guess the whole web app must be restarted to use the new data, right?

Restarting would work, but transformation and indexing can be triggered by a POST too, see: https://github.com/hbz/lobid-organisations#transform

But that should already be set up and working, see: https://github.com/hbz/lobid-organisations/blob/master/cron.sh

dr0i commented 6 years ago

Ok. I've added a quick note "complete indexing" to the crontab. Is there a way for @acka47 to check if the updated data is indeed indexed into lobid-organisations?

acka47 commented 6 years ago

Is there a way for @acka47 to check if the updated data is indeed indexed into lobid-organisations?

I would have to be aware of recent changes to check this which I am not.

dr0i commented 6 years ago

The new dbs-export was created and copied correctly. It's slightly smaller than the one before. It has no more commata (',') as separator but TAB. I doubt that the ETL was done correctly because of changing the separator, but I couldn't gain any reasonable infos out of the logs. Looking at e.g. http://lobid.org/organisations/DE-14-00 I see a field DBS ID though. Assigning @fsteeg for further investigation.

dr0i commented 6 years ago

Also, http://lobid.org/organisations/DBS-BA433, which is one of the new ones, doesn't exist.

fsteeg commented 6 years ago

It has no more commata (',') as separator but TAB.

Why? If we don't switch the format I'd expect the updates to keep working.

dr0i commented 6 years ago

Don't know why it's now TAB. I just executed the SQL from @acka47. Why this is of an other structure as the dbs.csv before I don't know. Haven't produced the old one.

dr0i commented 6 years ago

Ah @fsteeg I guess I can teach the SQL-statement to use a commata as separator. (edit @dr0i: did this with piping sql result through sed).

dr0i commented 6 years ago

@fsteeg I updated the new csv to be also comma separated and triggered the reindexing via cron. However I don't see the resource mentioned in https://github.com/hbz/lobid-organisations/issues/94#issuecomment-387035451 and the logs don't give me any hint what's happening (opened #400).

fsteeg commented 6 years ago

I took a look into this, but it seems something is going wrong when creating the file, it is empty:

sol@weywot2:[~/git/lobid-organisations/app/transformation/input] master
$ ls -alh dbs.csv 
-rw-rw-r-- 1 sol sol 0 May 20 01:30 dbs.csv
dr0i commented 6 years ago

Oh, the mysql-login is no more valid. Informed GSCH.

dr0i commented 6 years ago

Because of DSGVO it was decided to not allow any updates. Closing.

acka47 commented 6 years ago

Reopening as the status of DBS data access is still unclear.

acka47 commented 6 years ago

It looks like there currently is no DBS data at all in lobid-organisations: https://lobid.org/organisations/search?q=_exists_%3AdbsID+AND+NOT+_exists_%3Aisil

dr0i commented 6 years ago

Transformation fails with (see application.log):

Transformation failed java.lang.IllegalArgumentException: wrong number of columns (expected 18, was 19) in input line:

Comparing the old csv and the new one: in the old csv there are sometimes commatas part of a value e.g.

... ,"Mo-Fr.: 8-24, Sa: 9-24 Uhr", ...

and these possible fields are then in quotes ("csv with enclosed fields"). In the new dbs-dump these quotes are missing. So, my sed-command to substitute tabs with commata is not enough, it should be like sed 's#^#"#g;s#\t#","#g;s#$#"#g'. Don't know if this will work correctly, though, because there exist many fields without quotes in the old csv (the new one would then always have a comma and quote). Tabs as delimiter (in generally: a delimiter that's not used in the value) are so much easier. However I will write to Therese N. to update the postprocessing regex and see if the transformation will work correctly.

acka47 commented 6 years ago

I don't know whether you interpreted the problem correctly. We actually removed one column (fax) from the SQL query, so it is correct that there is one less. So we must adjust the code there to expect 18 columns.

dr0i commented 6 years ago

Yeah, me too saw that at first as the most plausible source of the error. I admit to have had problems to easily find where this number is specified, but I believe the first line (the description of the csv columns) is analyzed. There is now correctly the "fax" missing, and, counting the commatas, this makes exactly 18 defined columns. Now, my example of the "opening hours" shows that the data has in fact 19 columns because the quotes are missing. Also I've made a quick test by manually correcting a line and index that and this was good without errors.

dr0i commented 6 years ago

Sent Therese this regex: sed 's#^#"#g;s#\t#","#g;s#$#"#g'.

dr0i commented 6 years ago

Update finished. Looks good to me! Assigning @acka47 for review.

acka47 commented 6 years ago

+1 Did you already implement an automatic update procedure for DBS data?

acka47 commented 6 years ago

As discussed offline, @dr0i already implemented automatic updates. We will have to check whether this works next week.

dr0i commented 6 years ago

Seems to work, at least:

Closing.