silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)
Apache License 2.0
1 stars 0 forks source link

New mappings need to be integrated #56

Closed tschleider closed 3 years ago

tschleider commented 4 years ago

After @mpuren and Pierre reworked some / all of the existing mappings I still need to implement some of the changes.

rtroncy commented 4 years ago

This seems a very general issue. Is there a way to see the exhaustive list of changes that have been made? We can of course look at the history of each google document but this is not very convenient! Ideally, I would suggest to list of museum source with a checkbox in this github and to describe the changes that need to be dealt with so we can check the boxes upon commits.

tschleider commented 4 years ago

@mpuren said she marked every change in the paths as red.

rtroncy commented 3 years ago

The phrasing of this issue is still too general in the sense that we don't see immediately what needs to be changed and for which museum.

@tschleider Go through each museum mapping file, and report in this issue when you have validated all changes colored in red in the google docs. Do close the comments in google docs and change the colors as well.

tschleider commented 3 years ago

I went through everything once again and there are certain general topics that need to be addressed (next to some smaller ones):

Full status: https://docs.google.com/spreadsheets/d/1xcugWoTENsk-g7QC4fz4AeLn-VbJNgvZbwxXDTIMmr8/edit?usp=sharing

tschleider commented 3 years ago

Except for that we need to do some crawler updates, which I just communicated to @ehrhart :

rtroncy commented 3 years ago

New crawling has been done while ago. What is the status now?

ehrhart commented 3 years ago

IMATEX has been recrawled a while ago.

VAM still has to be updated (https://github.com/silknow/crawler/issues/36) and re-crawled. I'll keep you posted.

ehrhart commented 3 years ago

After verification the crawler for VAM has already been updated and the dump from July contains the "Subject depicted". The depicted subject is split into multiple fields on VAM's API, based on the type of subject. See https://github.com/silknow/crawler/issues/36 for the list of fields.

tschleider commented 3 years ago

Thanks @ehrhart ! I'll follow up by finishing the converter and will close the issue afterwards

tschleider commented 3 years ago

Ok I updated the converter and it can handle the new format now (which concludes this issue itself).

However, @ehrhart , it seems that the newer dump is way smaller than the older dumps on ownCloud. I therefore mixed them and kept the converter compatible to both. Is there a reason that the newer dump is much smaller than before?

EDIT: Just in light of @ehrhart 's new input, here the number of records in the VAM dump from December 2020 (most recent one up until now): 7747 records.

ehrhart commented 3 years ago

@tschleider I updated VAM crawler with an additional fix, but I just wanted to note that the number of results is still lower than what we used to have previously. However, I don't know if it was due to incorrect crawling in the past, or that VAM changed something to their dataset/API, or both.

I am crawling using 3 methods as described by the logbook, for a total of 943 results (some methods return similar results which is why we only have 943 results instead of 1155):

I will share the new dumps soon if that sounds okay.

ehrhart commented 3 years ago

VAM Records: vam_records_20211110_5.tar.gz VAM Files: vam_files_20211110_5.tar.gz

tschleider commented 3 years ago

Thanks for the new dump and the explanation, @ehrhart . It makes sens how you download the objects, but this is still a huge reduction of data from the KG (if I would simply replace the old dump of ca. 7700 files with the new one of less than 1000 file it's reduction of almost 7000 files).

What do you think we should do @rtroncy ?

rtroncy commented 3 years ago

I would keep the oldest VAM dump. This issue is now old and I lost track. Why did we want to re-crawl it in the first place? @tschleider Can you summarize the current issue?

tschleider commented 3 years ago

@rtroncy Their website got updated and they added a field for "Subject depicted", that would be of course highly relevant as we don't have a lot of (English) data for subject depiction. But now the new harvest (@ehrhart explains how he does it above) yields much less objects.

I would also rather keep the old dump as of now, but I could "mix" the files (so if a object has a newer download including "subject depicted" it will be used instead of its old version, otherwise it's the old dump).

What do you think?

rtroncy commented 3 years ago

A smart merge of the 2 dumps would be idea so a BIG +1 but I don't want that this takes too much time. You're the best to judge.

tschleider commented 3 years ago

I combined the most recent dump with the last one that was big and re-converted VAM (it's online). I also uploaded this combined dump to OwnCloud: https://silknow.uv.es/owncloud/index.php/apps/files/?dir=/Data&fileid=626898

Everything discussed in this issue has been integrated