internetarchive / openlibrary-librarians

Coordination between the OpenLibrary.org Librarian community
17 stars 3 forks source link

Works without assigned language #68

Open bicolino34 opened 2 years ago

bicolino34 commented 2 years ago

There are 89 thousand of books without language https://openlibrary.org/search?q=language:und

BrittanyBunk commented 2 years ago

@bicolino34 I understand it's difficult to see it like this, but unfortunately there's not much that can be done about a lot of these books. First, there're other steps that'll be better to take before adding a language, like finding duplicates, because it's a step that comes after that. Otherwise you can add languages onto books that shouldn't exist on the site.

The other issue is that, without actually seeing the book, it's too difficult to know which language it's written in.

Another issue is that if it is written in another language, it's too hard to find out. If it's a picture, I can't translate it.

To me, this is not an issue for 'not in library' books, especially until other issues are fixed first.

However, we could do some work on those 'in library', but I posted an issue with some of them below.

BrittanyBunk commented 2 years ago

@bicolino34 there is another issue that I'd like help with more and that is #52 if you'd like to help me there with that. I really really truly need the help. The person who said will help didn't do any of it and I need that info to find my lost textbooks from when I was in school. Plus, these books aren't even on the Open Library. Adding them on will be a huge plus to people finding books that're hard to find that they probably grew up with too. Then we can add the languages onto them there, as they're all in English. What do you say?

BrittanyBunk commented 2 years ago

@seabelis I just checked the list and there're a lot of old books in English. What's the difference between old english and english - in terms of dates and what they look like. Are we able to add a description to the OL to help people distinguish between them?

BrittanyBunk commented 2 years ago

@bicolino34 Somehow I feel this is a bot issue, not a librarian issue. Maybe we can place this into the open library bots github? The reason is that the bot shouldn't put 'undetermined' as a language at all. That's not a real language.

tfmorris commented 1 year ago

Technically these are coded with the "language" of Undetermined. Their number is dwarfed by those with no language declaration. Current counts are:

I got these numbers for an analysis that I did of language identification software predictions for titles vs their metadata encoded values. This could potentially be used to suggest languages (although there are a bunch of nuances/issues that I'll cover when I publish the data).