Closed acka47 closed 6 years ago
I will speak to Ira soon about this. There are different options – short-term and mid-/long-term.
I discussed with Ira that @dr0i will get access to the DBS database. If Gerhard is back in September he will instruct @dr0i.
We will have a meeting with DBS in October where we will probably get access to the database so that we can set up regular exports of DBS data.
Renamed the issue to set up regular DBS updates. We now have access to the database. @dr0i had some questions regarding the next steps.
The current version of DBS export we use for transformation can be found at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip (see https://github.com/hbz/lobid-organisations/blob/master/README.textile#deployment).
I think we made sure in the past that we want the DBS export as near to the original data as possible. Thus, the column names in the csv should be named the same as those in the data base. In the past we exported the following information (though we don't use it all): inr, iso, bib_typ, nam, plz, ort, str, stk_2007, sbi, isil, tvw, tel, fax, ema, url, opa, sta, oef, typ_text, utr_text, gro_text, leitung
Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.
Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.
I communicated about those and especially other changes with the DBS colleagues. So the DBS source should be much better than what we currently have in dbs.zip. No need to merge.
@acka47 for your initial test use sol@weywot2
. That's our staging environment.
We get nearly all relevant data with SELECT * FROM dbs.adresse;
. I put the row names and two examples rows at https://gist.github.com/acka47/67199ec585f1f6adeea3176b0d69dd76.
Filter out active libraries with status_2_id = 1
: We only want to import the active libraries to lobid, this means to only get rows with value 1
in status_2_id
.
We will have to adjust the transformation a bit as column names differ from the current dbs.csv file. Here is a comparison old -> new:
inr
-> inr
iso
-> iso
bib_typ
-> fb
nam
-> name
plz
-> plz
ort
-> ort
str
-> strasse
stk_2007
-> gvz_id
sbi
-> sigel
isil
-> isigel
tvw
-> vorwahl
(with other syntax, e.g. "(0221)" instead of "-221")tel
-> tel
fax
-> fax
ema
-> email
url
-> url
opa
-> opac
sta
-> ?? (Not sure, but I think we don't use this information anyway.)oef
-> oeffnungszeiten
typ_text
-> String is in another table, but I think we use ~bib_typ
/fb
~ typ_id
anyway.gro_text
-> bestandsgroessenklasse_id
(We will have to adjust the mapping here as we will get IDs for the Bestandsgrößenklasse ("1" to "11") instead of textual values like "100.001 - 300.000", see the csv at https://gist.github.com/acka47/f6a248a46249367becc681774d9d0b14leitung
-> leitung
Missing field "Unterhaltsträger"
What is missing in dbs.adresse
is information on Unterhaltsträger. This is in another table dbs.utr_von
where a DBS ID is associated with a Unterhaltsträger ID (1-15, see this gist). So, @dr0i was right that we have to join different tables, at least in this case.
*Additional fields
sortiername
as alternateName
.postfach
fachstelle_inr
to add a link to the responsible Fachstelledbv_id
I tried to join the tables without any previous SQL experience and finally this did work:
SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr;
Here is the query with filtering out non-active libraries:
SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
As we have to get the newlines in a field (oeffnungszeiten, name) out I adjusted the MySQL query like this:
SELECT
dbs.adresse.inr,
dbs.adresse.iso,
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.tel,
dbs.adresse.fax,
dbs.adresse.email,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
With this, we also only get the fields we need.
Here is another small update to the query. As we had problems with the column being named name
, we just call it nam
again:
SELECT
dbs.adresse.inr,
dbs.adresse.iso,
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.tel,
dbs.adresse.fax,
dbs.adresse.email,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
Here is another small adjustment to the sql query as response to the problem with added "email" described in https://github.com/hbz/lobid-organisations/pull/393#issuecomment-346291861:
SELECT
dbs.adresse.inr,
dbs.adresse.iso,
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel,
dbs.adresse.fax,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
Another update to the query was needed, see https://github.com/hbz/lobid-organisations/pull/393#issuecomment-346624698:
SELECT
dbs.adresse.inr,
dbs.adresse.iso,
dbs.adresse.typ_id,
REPLACE(REPLACE(dbs.adresse.name, '\r', ' '), '\n', ' ') AS nam,
dbs.adresse.plz,
dbs.adresse.ort,
dbs.adresse.strasse,
dbs.adresse.gvz_id,
dbs.adresse.isigel AS isil,
dbs.adresse.fax,
dbs.adresse.url,
dbs.adresse.opac,
REPLACE(REPLACE(dbs.adresse.oeffnungszeiten, '\r', ' '), '\n', ' ') AS oeffnungszeiten,
dbs.adresse.bestandsgroessenklasse_id,
dbs.adresse.sortiername,
dbs.adresse.postfach,
dbs.adresse.fachstelle_inr,
dbs.adresse.dbv_id,
dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;
With the sql-statement you provided it's in any case easy to get the data. Just @acka47 give me offline the passwords etc. , will crontab a regular update.
Don't we run regular updates already? Regarding user name & password, you already received them in an email from G.S. with time stamp Tue, 17 Oct 2017 09:06:10 +0200
No Sir no regular updates yet. Postponing this until May till it's clear if the data conforms to DS-VGO.
The dbs.csv
is sql-ed via my user at mysql-server. It is thn copied to the proper location of quaoar1 and weywot2. See crontab in weywot1, as usual.
@fsteeg I guess the whole web app must be restarted to use the new data, right? @acka47 How often is this necessary?
@fsteeg I guess the whole web app must be restarted to use the new data, right?
Restarting would work, but transformation and indexing can be triggered by a POST too, see: https://github.com/hbz/lobid-organisations#transform
But that should already be set up and working, see: https://github.com/hbz/lobid-organisations/blob/master/cron.sh
Ok. I've added a quick note "complete indexing" to the crontab. Is there a way for @acka47 to check if the updated data is indeed indexed into lobid-organisations?
Is there a way for @acka47 to check if the updated data is indeed indexed into lobid-organisations?
I would have to be aware of recent changes to check this which I am not.
The new dbs-export was created and copied correctly. It's slightly smaller than the one before. It has no more commata (',') as separator but TAB. I doubt that the ETL was done correctly because of changing the separator, but I couldn't gain any reasonable infos out of the logs. Looking at e.g. http://lobid.org/organisations/DE-14-00 I see a field DBS ID though. Assigning @fsteeg for further investigation.
Also, http://lobid.org/organisations/DBS-BA433, which is one of the new ones, doesn't exist.
It has no more commata (',') as separator but TAB.
Why? If we don't switch the format I'd expect the updates to keep working.
Don't know why it's now TAB. I just executed the SQL from @acka47. Why this is of an other structure as the dbs.csv before I don't know. Haven't produced the old one.
Ah @fsteeg I guess I can teach the SQL-statement to use a commata as separator.
(edit @dr0i: did this with piping sql result through sed
).
@fsteeg I updated the new csv to be also comma separated and triggered the reindexing via cron. However I don't see the resource mentioned in https://github.com/hbz/lobid-organisations/issues/94#issuecomment-387035451 and the logs don't give me any hint what's happening (opened #400).
I took a look into this, but it seems something is going wrong when creating the file, it is empty:
sol@weywot2:[~/git/lobid-organisations/app/transformation/input] master
$ ls -alh dbs.csv
-rw-rw-r-- 1 sol sol 0 May 20 01:30 dbs.csv
Oh, the mysql-login is no more valid. Informed GSCH.
Because of DSGVO it was decided to not allow any updates. Closing.
Reopening as the status of DBS data access is still unclear.
It looks like there currently is no DBS data at all in lobid-organisations: https://lobid.org/organisations/search?q=_exists_%3AdbsID+AND+NOT+_exists_%3Aisil
Transformation fails with (see application.log):
Transformation failed java.lang.IllegalArgumentException: wrong number of columns (expected 18, was 19) in input line:
Comparing the old csv and the new one: in the old csv there are sometimes commatas part of a value e.g.
... ,"Mo-Fr.: 8-24, Sa: 9-24 Uhr", ...
and these possible fields are then in quotes ("csv with enclosed fields"). In the new dbs-dump these quotes are missing.
So, my sed-command to substitute tabs with commata is not enough, it should be like
sed 's#^#"#g;s#\t#","#g;s#$#"#g'
. Don't know if this will work correctly, though, because there exist many fields without quotes in the old csv (the new one would then always have a comma and quote).
Tabs as delimiter (in generally: a delimiter that's not used in the value) are so much easier. However I will write to Therese N. to update the postprocessing regex and see if the transformation will work correctly.
I don't know whether you interpreted the problem correctly. We actually removed one column (fax) from the SQL query, so it is correct that there is one less. So we must adjust the code there to expect 18 columns.
Yeah, me too saw that at first as the most plausible source of the error. I admit to have had problems to easily find where this number is specified, but I believe the first line (the description of the csv columns) is analyzed. There is now correctly the "fax" missing, and, counting the commatas, this makes exactly 18 defined columns. Now, my example of the "opening hours" shows that the data has in fact 19 columns because the quotes are missing. Also I've made a quick test by manually correcting a line and index that and this was good without errors.
Sent Therese this regex: sed 's#^#"#g;s#\t#","#g;s#$#"#g'
.
Update finished. Looks good to me! Assigning @acka47 for review.
+1 Did you already implement an automatic update procedure for DBS data?
As discussed offline, @dr0i already implemented automatic updates. We will have to check whether this works next week.
Seems to work, at least:
Closing.
To be discussed with DBS colleagues. Questions (amongst others):