Closed mekarpeles closed 3 years ago
OAPEN is a related/joint project that has some additional books, as well as metadata in MARC/ONYX format if that's helpful for import: http://www.oapen.org/content/metadata
@mekarpeles I would like to take this. Can you guide me on this?
Hi @ChangezKhan, are you still interested in helping with this task? I am in the process of improving OL's import process in general and would like to enable more volunteers to begin with imports.
It sounds like we want to add more data validation, fix search, and fix the clean up bot before we start importing more data. I'm assigning @hornc per slack discussions for the time being.
Oapen bulk MARC records are located at https://archive.org/download/marc_oapen ready to be imported using the /api/import/ia endpoint
These records have links to online materials in field 856
e.g.
856 40 $uhttp://www.oapen.org/download?type=document&docid=1006695$zAccess full text online
from https://openlibrary.org/show-records/marc_oapen/oapen.marc.mrc:3954:1260
Here is an example imported editon: https://openlibrary.org/books/OL28324519M.json
The links were imported from 856$u
, but no titles 856$z
The links do not show in the current books UI.
The importer should import the link title when available.
10 test items imported:
FILENAME: oapen.marc.mrc
marc_oapen/oapen.marc.mrc:0:5: 200 -- {'success': True, 'next_record_length': 1977, 'work': {'status': 'matched', 'key': '/works/OL20909504W'}, 'edition': {'status': 'matched', 'key': '/books/OL28324519M'}, 'next_record_offset': 1977}
marc_oapen/oapen.marc.mrc:1977:1977: 200 -- {'success': True, 'next_record_length': 1260, 'next_record_offset': 3954, 'authors': [{'status': 'matched', 'name': 'Arnd Reitemeier', 'key': '/authors/OL1511292A'}], 'work': {'status': 'created', 'key': '/works/OL20909526W'}, 'edition': {'status': 'created', 'key': '/books/OL28324541M'}}
marc_oapen/oapen.marc.mrc:3954:1260: 200 -- {'success': True, 'next_record_length': 1246, 'next_record_offset': 5214, 'authors': [{'status': 'created', 'name': 'Alexandra Ch. J. von Miller', 'key': '/authors/OL8001063A'}], 'work': {'status': 'created', 'key': '/works/OL20909527W'}, 'edition': {'status': 'created', 'key': '/books/OL28324542M'}}
marc_oapen/oapen.marc.mrc:5214:1246: 200 -- {'success': True, 'next_record_length': 1763, 'work': {'status': 'matched', 'key': '/works/OL20909527W'}, 'edition': {'status': 'modified', 'key': '/books/OL28324542M'}, 'next_record_offset': 6460}
marc_oapen/oapen.marc.mrc:6460:1763: 200 -- {'success': True, 'next_record_length': 2229, 'work': {'status': 'created', 'key': '/works/OL20909528W'}, 'edition': {'status': 'created', 'key': '/books/OL28324543M'}, 'next_record_offset': 8223}
marc_oapen/oapen.marc.mrc:8223:2229: 200 -- {'success': True, 'next_record_length': 1813, 'next_record_offset': 10452, 'authors': [{'status': 'created', 'name': 'Morten Beckmann', 'key': '/authors/OL8001064A'}], 'work': {'status': 'created', 'key': '/works/OL20909529W'}, 'edition': {'status': 'created', 'key': '/books/OL28324545M'}}
marc_oapen/oapen.marc.mrc:10452:1813: 200 -- {'success': True, 'next_record_length': 1988, 'next_record_offset': 12265, 'authors': [{'status': 'matched', 'name': 'Peter Altmann', 'key': '/authors/OL3192866A'}], 'work': {'status': 'created', 'key': '/works/OL20909530W'}, 'edition': {'status': 'created', 'key': '/books/OL28324546M'}}
marc_oapen/oapen.marc.mrc:12265:1988: 200 -- {'success': True, 'next_record_length': 1882, 'work': {'status': 'created', 'key': '/works/OL20909531W'}, 'edition': {'status': 'created', 'key': '/books/OL28324547M'}, 'next_record_offset': 14253}
marc_oapen/oapen.marc.mrc:14253:1882: 200 -- {'success': True, 'next_record_length': 1651, 'next_record_offset': 16135, 'authors': [{'status': 'created', 'name': 'Emanuel Ruoss', 'key': '/authors/OL8001065A'}], 'work': {'status': 'created', 'key': '/works/OL20909533W'}, 'edition': {'status': 'created', 'key': '/books/OL28324548M'}}
marc_oapen/oapen.marc.mrc:16135:1651: 200 -- {'success': True, 'next_record_length': 1985, 'work': {'status': 'created', 'key': '/works/OL20909534W'}, 'edition': {'status': 'created', 'key': '/books/OL28324549M'}, 'next_record_offset': 17786}
One issue resolved by #3573
Another issue: [ ] : If an existing match is found on import, the available links are not added, e.g. https://openlibrary.org/books/OL28057410M/Disability_in_Industrial_Britain, in order to direct users to to open access works, the links should be added to existing items on import.
Also, the raw MARC binary data appears to have broken character sets and offsets in the source data we have. I have obtained MARC XML and converted it correctly to utf-8 and re-uploaded to the item as oapen.marc.REPAIRED.mrc
The MARC XML has less records, and the link format is not correct either, License URLs are inplace of the link descriptions, e.g.:
<datafield tag="856" ind1="4" ind2="0">
<subfield code="u">http://oapen.org/download?type=document&docid=620931</subfield>
<subfield code="z">https://creativecommons.org/licenses/by-nc-nd/4.0/</subfield>
Both data files unfortunately will need further work to reapir / get them into a state for importing.
XML records: 3932 Binary MARC records: 10355
I believe I have imported all the available DOAB records into OL from https://archive.org/download/marc_oapen . Unfortunately the original source of these records is no longer active, so it isn't possible to get an accurate current version.
The links were imported from 856$u, but no titles 856$z The links do not show in the current books UI.
I don't understand the decision not to show the link. How is the user going to discover that this is a freely downloadable book without it?
Unfortunately the original source of these records is no longer active, so it isn't possible to get an accurate current version.
The FAQ says everything is available in MARCXML format using OAI. e.g. all Math books here: https://www.doabooks.org/oai?verb=ListRecords&set=Mathematics_and_Statistics&metadataPrefix=marcxml
@tfmorris I was getting the bulk data from the previous link above via http://www.oapen.org/ , which looks outdated
I'll try to get the full set using OAI (requires many requests AFAICT). Looks like I can get it, but it needs processing to extract the MARCXML from the OAI response XML.
The links were imported from 856$u, but no titles 856$z The links do not show in the current books UI. was a bug, fixed in some further imports by #3573
The records from the oai endpoint have links like:
<datafield tag="856" ind1="4" ind2="0">
<subfield code="u">https://www.doabooks.org/doab?func=fulltext&rid=17062</subfield>
<subfield code="z">Description of rights in Directory of Open Access Books (DOAB): OpenEdition licence for Books</subfield>
</datafield>
<datafield tag="856" ind1="4" ind2="0">
<subfield code="u">http://books.openedition.org/ceup/1571</subfield>
</datafield>
The second link has no other info or description which is currently expected by the OL import process.
~In the absence of any description OL could use the URL as the (required) description. This would require another code change.~ Looking at #3573 again it will use the description External source
if there is none specified in the MARC.
For the links, I'm judging by what I see at https://openlibrary.org/books/OL28324519M. Is there a link rendered someplace on that page that I'm missing?
The replacement for http://www.oapen.org/content/metadata appears to be the links on https://www.oapen.org/resources/15635975-metadata
Thanks for those links @tfmorris , the MARC XML from https://www.oapen.org/resources/15635975-metadata has more records then I imported previously. Their links also are better described, here's an example: https://openlibrary.org/books/OL31366776M
Another example: https://openlibrary.org/books/OL31366979M/Sex_and_gender_in_biomedicine
I have extracted a UTF8 MARC dump from https://www.oapen.org/resources/15635975-metadata
(uploaded to https://archive.org/download/marc_oapen as convert_oapen_20201117.mrc
)
Import log uploaded to the same item as DOAB_NOV.log
@bnewbold suggested we add DOAB https://www.doabooks.org/ open access books to Open Library. Seems like a good job for https://github.com/internetarchive/openlibrary-client!
Anyone interested in trying to add these?