internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.19k stars 1.35k forks source link

Import DOAB (Directory of Open Access Books) into openlibrary #1401

Closed mekarpeles closed 3 years ago

mekarpeles commented 6 years ago

@bnewbold suggested we add DOAB https://www.doabooks.org/ open access books to Open Library. Seems like a good job for https://github.com/internetarchive/openlibrary-client!

Anyone interested in trying to add these?

bnewbold commented 6 years ago

OAPEN is a related/joint project that has some additional books, as well as metadata in MARC/ONYX format if that's helpful for import: http://www.oapen.org/content/metadata

SohanTirpude commented 6 years ago

@mekarpeles I would like to take this. Can you guide me on this?

hornc commented 5 years ago

Hi @ChangezKhan, are you still interested in helping with this task? I am in the process of improving OL's import process in general and would like to enable more volunteers to begin with imports.

xayhewalo commented 4 years ago

It sounds like we want to add more data validation, fix search, and fix the clean up bot before we start importing more data. I'm assigning @hornc per slack discussions for the time being.

hornc commented 4 years ago

Oapen bulk MARC records are located at https://archive.org/download/marc_oapen ready to be imported using the /api/import/ia endpoint

hornc commented 4 years ago

These records have links to online materials in field 856 e.g. 856 40 $uhttp://www.oapen.org/download?type=document&docid=1006695$zAccess full text online from https://openlibrary.org/show-records/marc_oapen/oapen.marc.mrc:3954:1260

Here is an example imported editon: https://openlibrary.org/books/OL28324519M.json

The links were imported from 856$u, but no titles 856$z The links do not show in the current books UI.

The importer should import the link title when available.

hornc commented 4 years ago

10 test items imported:

FILENAME: oapen.marc.mrc
marc_oapen/oapen.marc.mrc:0:5: 200 -- {'success': True, 'next_record_length': 1977, 'work': {'status': 'matched', 'key': '/works/OL20909504W'}, 'edition': {'status': 'matched', 'key': '/books/OL28324519M'}, 'next_record_offset': 1977}
marc_oapen/oapen.marc.mrc:1977:1977: 200 -- {'success': True, 'next_record_length': 1260, 'next_record_offset': 3954, 'authors': [{'status': 'matched', 'name': 'Arnd Reitemeier', 'key': '/authors/OL1511292A'}], 'work': {'status': 'created', 'key': '/works/OL20909526W'}, 'edition': {'status': 'created', 'key': '/books/OL28324541M'}}
marc_oapen/oapen.marc.mrc:3954:1260: 200 -- {'success': True, 'next_record_length': 1246, 'next_record_offset': 5214, 'authors': [{'status': 'created', 'name': 'Alexandra Ch. J. von Miller', 'key': '/authors/OL8001063A'}], 'work': {'status': 'created', 'key': '/works/OL20909527W'}, 'edition': {'status': 'created', 'key': '/books/OL28324542M'}}
marc_oapen/oapen.marc.mrc:5214:1246: 200 -- {'success': True, 'next_record_length': 1763, 'work': {'status': 'matched', 'key': '/works/OL20909527W'}, 'edition': {'status': 'modified', 'key': '/books/OL28324542M'}, 'next_record_offset': 6460}
marc_oapen/oapen.marc.mrc:6460:1763: 200 -- {'success': True, 'next_record_length': 2229, 'work': {'status': 'created', 'key': '/works/OL20909528W'}, 'edition': {'status': 'created', 'key': '/books/OL28324543M'}, 'next_record_offset': 8223}
marc_oapen/oapen.marc.mrc:8223:2229: 200 -- {'success': True, 'next_record_length': 1813, 'next_record_offset': 10452, 'authors': [{'status': 'created', 'name': 'Morten Beckmann', 'key': '/authors/OL8001064A'}], 'work': {'status': 'created', 'key': '/works/OL20909529W'}, 'edition': {'status': 'created', 'key': '/books/OL28324545M'}}
marc_oapen/oapen.marc.mrc:10452:1813: 200 -- {'success': True, 'next_record_length': 1988, 'next_record_offset': 12265, 'authors': [{'status': 'matched', 'name': 'Peter Altmann', 'key': '/authors/OL3192866A'}], 'work': {'status': 'created', 'key': '/works/OL20909530W'}, 'edition': {'status': 'created', 'key': '/books/OL28324546M'}}
marc_oapen/oapen.marc.mrc:12265:1988: 200 -- {'success': True, 'next_record_length': 1882, 'work': {'status': 'created', 'key': '/works/OL20909531W'}, 'edition': {'status': 'created', 'key': '/books/OL28324547M'}, 'next_record_offset': 14253}
marc_oapen/oapen.marc.mrc:14253:1882: 200 -- {'success': True, 'next_record_length': 1651, 'next_record_offset': 16135, 'authors': [{'status': 'created', 'name': 'Emanuel Ruoss', 'key': '/authors/OL8001065A'}], 'work': {'status': 'created', 'key': '/works/OL20909533W'}, 'edition': {'status': 'created', 'key': '/books/OL28324548M'}}
marc_oapen/oapen.marc.mrc:16135:1651: 200 -- {'success': True, 'next_record_length': 1985, 'work': {'status': 'created', 'key': '/works/OL20909534W'}, 'edition': {'status': 'created', 'key': '/books/OL28324549M'}, 'next_record_offset': 17786}

One issue resolved by #3573

hornc commented 4 years ago

Another issue: [ ] : If an existing match is found on import, the available links are not added, e.g. https://openlibrary.org/books/OL28057410M/Disability_in_Industrial_Britain, in order to direct users to to open access works, the links should be added to existing items on import.

hornc commented 4 years ago

Also, the raw MARC binary data appears to have broken character sets and offsets in the source data we have. I have obtained MARC XML and converted it correctly to utf-8 and re-uploaded to the item as oapen.marc.REPAIRED.mrc

hornc commented 4 years ago

The MARC XML has less records, and the link format is not correct either, License URLs are inplace of the link descriptions, e.g.:

      <datafield tag="856" ind1="4" ind2="0">
         <subfield code="u">http://oapen.org/download?type=document&amp;docid=620931</subfield>
         <subfield code="z">https://creativecommons.org/licenses/by-nc-nd/4.0/</subfield>

Both data files unfortunately will need further work to reapir / get them into a state for importing.

XML records: 3932 Binary MARC records: 10355

hornc commented 3 years ago

I believe I have imported all the available DOAB records into OL from https://archive.org/download/marc_oapen . Unfortunately the original source of these records is no longer active, so it isn't possible to get an accurate current version.

tfmorris commented 3 years ago

The links were imported from 856$u, but no titles 856$z The links do not show in the current books UI.

I don't understand the decision not to show the link. How is the user going to discover that this is a freely downloadable book without it?

Unfortunately the original source of these records is no longer active, so it isn't possible to get an accurate current version.

The FAQ says everything is available in MARCXML format using OAI. e.g. all Math books here: https://www.doabooks.org/oai?verb=ListRecords&set=Mathematics_and_Statistics&metadataPrefix=marcxml

hornc commented 3 years ago

@tfmorris I was getting the bulk data from the previous link above via http://www.oapen.org/ , which looks outdated

I'll try to get the full set using OAI (requires many requests AFAICT). Looks like I can get it, but it needs processing to extract the MARCXML from the OAI response XML.

The links were imported from 856$u, but no titles 856$z The links do not show in the current books UI. was a bug, fixed in some further imports by #3573

The records from the oai endpoint have links like:

        <datafield tag="856" ind1="4" ind2="0">
          <subfield code="u">https://www.doabooks.org/doab?func=fulltext&amp;rid=17062</subfield>
          <subfield code="z">Description of rights in Directory of Open Access Books (DOAB): OpenEdition licence for Books</subfield>
        </datafield>
        <datafield tag="856" ind1="4" ind2="0">
          <subfield code="u">http://books.openedition.org/ceup/1571</subfield>
        </datafield>

The second link has no other info or description which is currently expected by the OL import process.

~In the absence of any description OL could use the URL as the (required) description. This would require another code change.~ Looking at #3573 again it will use the description External source if there is none specified in the MARC.

tfmorris commented 3 years ago

For the links, I'm judging by what I see at https://openlibrary.org/books/OL28324519M. Is there a link rendered someplace on that page that I'm missing?

The replacement for http://www.oapen.org/content/metadata appears to be the links on https://www.oapen.org/resources/15635975-metadata

hornc commented 3 years ago

Thanks for those links @tfmorris , the MARC XML from https://www.oapen.org/resources/15635975-metadata has more records then I imported previously. Their links also are better described, here's an example: https://openlibrary.org/books/OL31366776M

Another example: https://openlibrary.org/books/OL31366979M/Sex_and_gender_in_biomedicine

hornc commented 3 years ago

I have extracted a UTF8 MARC dump from https://www.oapen.org/resources/15635975-metadata (uploaded to https://archive.org/download/marc_oapen as convert_oapen_20201117.mrc ) Import log uploaded to the same item as DOAB_NOV.log

https://archive.org/download/marc_oapen/DOAB_NOV.log