Closed LauraErhard closed 7 years ago
Thanks for reporting the issue. The bold emphasis of the field should not be a problem. Splitting into family name and given name is indeed a good idea. The linking to authority files would be also a great feature. For now we only store raw text values without any links. Implementing the linking is probably out-of-scope for the workshop milestone, since it would also require searching all authors that are already stored in the LOCDB system in the same way we do it now with resources.
Considering proper capitalization, it would probably be cleaner if it happens before the extracted data is passed to the front-end: either directly in the OCR component or the back-end. What do you think @anlausch ?
Just a note that this is probably more complicated: For example the author DENYS DE LA PATELLIÈRE
or Spanish authors can have several first names and several last names and on the other hand LIGO Scientific Collaboration
appears as authors of several papers.
Thus, I suggest to try to have a good automatic heuristic about lower/uppercase. The remaining errors can be corrected manually. But I wouldn't do anything about splitting into first and last name. Hopefully we can link to the correct publications (via CrossRef, OLC-Contents, our own DB, OpenCitations, WikiCite) where the metadata is hopefully entered correctly.
Thats a good point @zuphilip . Plus, searching in the internal/external data sources is typically not case-sensitive. Thus when there is any matching resource, it should not be a problem. When we create a new resource from OCR data, it is probably okay to let the librarian adjust the proper spelling/capitalization. If heuristics are still desired, the OCR component is probably the right spot.
Author disambiguation is another big topic which might be out of project scope.