internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.06k stars 1.31k forks source link

Import https://open.umn.edu/opentextbooks Open Text books #8551

Open jmagosta opened 9 months ago

jmagosta commented 9 months ago

I found a source of freely-available textbooks that are not indexed in OpenLibrary. This issue is to consider including their corpus in Open Library. Actually a short web search brings up several sites that announce freely available texts.

Tasks

There are numerous sources, I'm not sure which it was that I was checking out:

See https://open.umn.edu/opentextbooks, https://oercommons.org/hubs/open-textbooks, https://collegeopentextbooks.org/, https://openstax.org/,

RefersTo: TrustedBookProvider

Billa05 commented 9 months ago

I would like to help with this issue.

Billa05 commented 9 months ago

apparently this (https://collegeopentextbooks.org/ ) website redirects to (https://oercommons.org/) and I believe OpenStax(https://openstax.org/) is already among the trusted book providers.

so we are left with https://open.umn.edu/ & https://oercommons.org/ both of them are pretty good. @cdrini what are your thoughts on this and could you please guide me through the process to solve this issue.

cdrini commented 9 months ago

Let's consider this issue for open.umn.edu.

It seems like they have an API: https://open.umn.edu/opentextbooks/OTL-API.pdf

Here's the list of their textbooks as JSON: https://open.umn.edu/opentextbooks/textbooks.json

Hmm, it seems like this is an aggregator of books from other sites; our other Trusted Book Providers have been their own book publishers. This one might be a little more complicated, and might require some new infrastructure to support (we might want to do this using direct providers, which isn't fully implemented yet!). I don't think we can use them as a Book Provider since they're not directly providing the book. But we can import them for now!

So the task for this one is to use the API and convert all the books to our ImportRecord format (as defined by https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json ).

Note:

Billa05 commented 9 months ago

thank you for the instructions i will try to implement it if I get stuck somewhere I will ping you on Slack.

cdrini commented 9 months ago

Nice! I'd recommend pinging on the main channel, and @- mention me :) That way if someone else can answer your question before me you'll get a response faster!

Billa05 commented 8 months ago

hey @cdrini, I was working on #8529 as it's done now I will try my best to help with this issue. so far I can convert these many parameters, but I am not able to find a way to import book covers. could you please help me with that? And is there any required parameter I am missing? Screenshot 2023-12-03 222137

cdrini commented 8 months ago

Hey! You should be able to use eg cover: 'https://...' I believe. That should upload the cover!

Oh or do you mean the cover is not in the opentextbooks data API?

Billa05 commented 8 months ago

Hey! You should be able to use egcover: 'https://...'` I believe. That should upload the cover!

Oh or do you mean the cover is not in the opentextbooks data API?

It is not in the API could you check once please

cdrini commented 8 months ago

Alas it looks like the covers aren't available. That's ok.

Other notes:

After those changes, can you generate the import records for ~10 textbooks and display them here? That'll make it easier to verify if everything is in the right field and working correctly!

cdrini commented 8 months ago

@Billa05 Can you please pretty print these so that they're easier to verify? And place them in triple back ticks (see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code ). Using a start back tick fence like this: ```json will syntax highlight the json as well making it easier to validate

Billa05 commented 8 months ago

@cdrini, some of them are highlighted in red. Does that mean they are incorrect?

cdrini commented 8 months ago

Oh I just realized it's python not json! So use ```py

It would also be helpful if you could format them so that they have newlines ; I usually do a search for something like "python autoformat online" and then use something there.

Billa05 commented 8 months ago

Hey @cdrini, does this work?

{
    "title": "Methods for Stress Management",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["eng"],
    "description": "Stress is a feeling you get when faced with a challenge. In small doses, stress can be good for you because it makes you more alert and gives you a burst of energy. For instance, if you start to cross the street and see a car about to run you over, that jolt you feel helps you to jump out of the way before you get hit. But feeling stressed for a long time can take a toll on your mental and physical health. Even though it may seem hard to find ways to de-stress with all the things you have to do, it’s important to find those ways. Your health depends on it.",
    "subjects": ["Humanities"],
    "publishers": ["Pennsylvania State University"],
    "publish_date": "2023-12-04",
    "identifiers": {"opentextbooks": 1540},
    "source_records": ["opentextbooks:1540"],
    "authors": [{"name": "Allen Urich"}],
    "contributors": [],
    "lc_classifications": [],
}

{
    "title": "Counting Rocks! An Introduction to Combinatorics",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["eng"],
    "description": "This textbook, Counting Rocks!, is the written component of an interactive introduction to combinatorics at the undergradaute level. Throughout the text, we link to videos where we describe the material and provide examples; see the Youtube playlist on the Colorado State University (CSU) Mathematics YouTube channel. The major topics in this text are counting problems (Chapters 1-4), proof techniques (Chapter 5), recurrence relations and generating functions (Chapters 6-7), and graph theory (Chapters 8-12). The material and the problems we include are standard for an undergraduate combinatorics course. In this text, one of our goals was to describe the mathematical structures underlying problems in combinatorics. For example, we separate the description of sequences, permutations, sets and multisets in Chapter 3. In addition to the videos, we would like to highlight some other features of this book. Most chapters contain an investigation section, where students are led through a series of deeper problems on a topic. In several sections, we show students how to use the free online computing software SAGE in order to solve problems; this is especially useful for the problems on recurrence relations. We have included many helpful figures throughout the text, and we end each chapter (and many of the sections) with a list of exercises of varying difficulty.",
    "subjects": ["Mathematics"],
    "publishers": ["Henry Adams", "Rachel Pries", "Maria Gillespie"],
    "publish_date": "2023-11-29",
    "identifiers": {"opentextbooks": 1535},
    "source_records": ["opentextbooks:1535"],
    "authors": [
        {"name": "Henry Adams"},
        {"name": "Kelly Emmrich"},
        {"name": "Maria Gillespie"},
        {"name": "Shannon Golden"},
        {"name": "Dr. Rachel Pries"},
    ],
    "contributors": [],
    "lc_classifications": ["QA1"],
}

{
    "title": "Piacere!: Elementary Italian at The University of Iowa",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["ita"],
    "description": "Piacere! is an elementary Italian open-access textbook authorized by the Italian faculty of The University of Iowa. A comprehensive and flexible e-textbook, this Open Education Resource aims to support students’ acquisition of the grammar and vocabulary that ensure meaningful communication as well as to enhance students’ familiarity with Italian culture. Piacere! is divided into twenty-one units, which revolve around a specific theme, in order to facilitate a comprehensive learning and use of the language. Units include the following components: grammar, vocabulary, conversations, and readings. While in its organization the textbook is meant to offer the students with a flexible and approachable individual study experience, each component may be easily translated into classroom activities.",
    "subjects": ["Humanities", "Languages"],
    "publishers": ["University of Iowa"],
    "publish_date": "2023-11-29",
    "identifiers": {"opentextbooks": 1539},
    "source_records": ["opentextbooks:1539"],
    "authors": [
        {"name": "Lucia Gemmani"},
        {"name": "Irene Lottini"},
        {"name": "Claudia Sartini-Rideout"},
    ],
    "contributors": ["Andrea Trucchia"],
    "lc_classifications": ["P51"],
}

{
    "title": "Case Studies for Health, Research and Practice in Australia and New Zealand",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["eng"],
    "description": "The OER includes case studies of 5 families from a variety of backgrounds in metropolitan and regional Queensland (QLD), New South Wales (NSW), and Victoria (VIC), Australia. Case studies have been popular in nursing to help students bring their learning to life and enhance their critical thinking. However, often case studies appear in one unit or one particular content area to aid students learning for a particular condition or point in time. Taking a transformational, place-based approach, the OER case studies for health are set within metropolitan and regional areas, so learning is contextual and relatable. Case studies increase in complexity so that students can be introduced to and ‘get to know’ the families from their first year of study. As students progress through their studies, they meet the families again in different, often more complex scenarios. The families experience a variety of political and socio-economic circumstances, which helps students to learn about various healthcare contexts, build knowledge and understanding about the families’ circumstances from a holistic, person-centred, interprofessional perspective, and engage at a deeper level.",
    "subjects": ["Medicine"],
    "publishers": ["Council of Australian University Librarians"],
    "publish_date": "2023-11-01",
    "identifiers": {"opentextbooks": 1538},
    "source_records": ["opentextbooks:1538"],
    "authors": [
        {"name": "Nicola Whiteing"},
        {"name": "Lucy Shinners"},
        {"name": "Nicole Graham"},
        {"name": "Dima Nasrawi"},
        {"name": "Donna Wilson"},
        {"name": "Anna Foster"},
        {"name": "Elicia Kunst"},
        {"name": "Jennene Greenhill"},
    ],
    "contributors": [],
    "lc_classifications": ["RA440"],
}

{
    "title": "Un Análisis Científico del Ruido Ambiental y Laboral en Sectores Urbanos",
    "isbn_10": None,
    "isbn_13": "9789942651075",
    "languages": ["spa"],
    "description": "¿Cuál es el nivel de ruido existente Av. Av. 3 de Julio y sus intersecciones entre la calle Ambato y la Y del Indio Colorado de la ciudad de Santo Domingo?, el presente trabajo de investigación tiene como objetivo principal conocer el nivel de ruido existente en este sector de la ciudad, con el fin de determinar el alcance de su afectación y sobre todo sus fuentes, se realizó la evaluación correspondiente de acuerdo con la normativa TULSMA. y Decreto Ejecutivo N° 2393, aplicando la investigación de campo se midió la presencia de ruido ambiental y ocupacional, se requirió del uso de dosímetros calibrados y certificados para obtener información precisa. En esta investigación se estudió estas importantes intersecciones, siendo que, en cuanto a normativa ambiental sobrepasa el nivel tolerable con un diferencial de 10 decibeles en promedio, mientras que en referencia del ruido laboral esta intersección está por debajo de los niveles permisibles con al menos 9 decibeles en promedio, las posibles enfermedades subyacentes más comunes encontradas fueron el estrés, los dolores de cabeza que afectaban principalmente a la población comerciante permanente, y la actividad que mayor emisión de ruido produjo fue el tránsito vehicular, del mismo modo se identificó que el día que se produjo mayor contaminación acústica fue el sábado entre las 11:30 am hasta la 13:00 pm.",
    "subjects": ["Mathematics", "Natural Sciences", "Earth Sciences"],
    "publishers": ["Editorial Grupo AEA"],
    "publish_date": "2023-11-01",
    "identifiers": {"opentextbooks": 1537},
    "source_records": ["opentextbooks:1537"],
    "authors": [
        {"name": "Washington Javier Astudillo-Martínez"},
        {"name": "Aida Gabriela Andrade-Bravo"},
        {"name": "Jonathan-Douglas García-Valdez"},
        {"name": "Yuli Fernanda Almenaba-Guerrero"},
    ],
    "contributors": [],
    "lc_classifications": ["QA1", "QH301", "QE1"],
}

{
    "title": "Evaluación de la Satisfacción Laboral y Rendimiento Productivo de los Piscicultores Comunitarios",
    "isbn_10": None,
    "isbn_13": "9789942651068",
    "languages": ["spa"],
    "description": "El objetivo fue Evaluación del Satisfacción Laboral y Rendimiento Productivo de los Piscicultores Comunitarios. El trabajo de investigación se realizó en las unidades productoras de truchas de la comunidad de Pacococha, distrito y provincia de Castrovirreyna. Las variables de estudio fueron la satisfacción laboral y la productividad. El tipo de investigación es básica. El nivel de la investigación es Correlacional. El método es descriptivo, cualitativo y cuantitativo, describiéndose las variables involucradas y analizando su incidencia e interrelación en función a la relación causa – efecto. El diseño de investigación fue descriptivo – correlacional. En la investigación se tuvo como población a 20 piscicultores entre trabajadores y jefes, y debido a que el número de unidades que la integraron resulto accesible en su totalidad, esta fue igual a la muestra, es decir los 20 piscicultores. Para el sustento de la parte teórica se consultó diferentes fuentes bibliográficas y para el trabajo de campo se aplicó cuestionarios a la muestra identificada, formulándose dos instrumentos, que fueron validados oportunamente por juicio de expertos, a fin de efectivizar su aplicación correspondiente, los instrumentos utilizados fueron los cuestionarios de encuesta de satisfacción laboral y productividad, donde cada pregunta fue realizada de acorde a las variables considerando sus dimensiones e indicadores, con los cuales se obtuvo la información pertinente de los trabajadores de las unidades productoras. Los resultados nos muestran un coeficiente de correlación de Pearson r = 0.672, con un nivel de significancia menor a 0,05 (p-valor = 0,001). Por lo tanto, al ser el p-valor significativo concluimos que existe correlación entre las variables de estudio, se acepta la hipótesis alterna con un nivel de confianza del 95%. Como conclusión principal se ha determinado a través de la investigación que la productividad del factor humano se relaciona de forma positiva y moderada con la satisfacción laboral del personal de producción de las piscigranjas, con un grado de relación del 45.2%.",
    "subjects": ["Social Sciences", "Economics"],
    "publishers": ["Editorial Grupo AEA"],
    "publish_date": "2023-11-01",
    "identifiers": {"opentextbooks": 1536},
    "source_records": ["opentextbooks:1536"],
    "authors": [
        {"name": "Alberto Hugo Deza-Matías"},
        {"name": "Manuel Castrejón-Valdez"},
        {"name": "Edwin Rojas-Felipe"},
        {"name": "Noemi Gladys Mencia-Sánchez"},
        {"name": "Jorge Washington Rodríguez-Deza"},
        {"name": "Russbelt Yaulilahua-Huacho"},
    ],
    "contributors": [],
    "lc_classifications": ["H1", "HB171.5"],
}

{
    "title": "Reading Social Science Methods",
    "isbn_10": None,
    "isbn_13": "9781946011190",
    "languages": ["eng"],
    "description": "Science has great potential to benefit society, but this potential comes with risks as well. Directed at introductory level social science and humanities majors, this textbook teaches the rules and limits of social science methods. Reisner starts from the assumption that it is not necessary to be able to do research to read and judge the soundness of research publications. The chapters guide students through an explicit set of rules for reading research articles developed from three common research methods: content analysis, survey research, and experimental method.",
    "subjects": ["Social Sciences", "Humanities"],
    "publishers": ["University of Illinois Library - Urbana"],
    "publish_date": "2023-11-01",
    "identifiers": {"opentextbooks": 1534},
    "source_records": ["opentextbooks:1534"],
    "authors": [{"name": "Ann Reisner"}],
    "contributors": [],
    "lc_classifications": ["H1"],
}

{
    "title": "An Open Guide to Data Structures and Algorithms",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["eng"],
    "description": "This textbook serves as a gentle introduction for undergraduates to theoretical concepts in data structures and algorithms in computer science while providing coverage of practical implementation (coding) issues. The field of computer science (CS) supports a multitude of essential technologies in science, engineering, and communication as a social medium. The varied and interconnected nature of computer technology permeates countless career paths making CS a popular and growing major program. Mastery of the science behind computer science relies on an understanding of the theory of algorithms and data structures. These concepts underlie the fundamental tradeoffs that dictate performance in terms of speed, memory usage, and programming complexity that separate novice programmers from professional practitioners.",
    "subjects": ["Computer Science"],
    "publishers": ["PALNI"],
    "publish_date": "2021-06-28",
    "identifiers": {"opentextbooks": 1017},
    "source_records": ["opentextbooks:1017"],
    "authors": [{"name": "Paul W. Bible"}, {"name": "Lucas Moser"}],
    "contributors": ["Mia M. Scarlato"],
    "lc_classifications": ["QA76"],
}

{
    "title": "Introduction to Soil Science",
    "isbn_10": None,
    "isbn_13": None,
    "languages": ["eng"],
    "description": "This textbook introduces readers to the basics of soil science, including the physical, chemical, and biological properties of soils; soil formation, classification, and global distribution; soil health, soils and humanity, and sustainable land management.",
    "subjects": ["Natural Sciences", "Earth Sciences"],
    "publishers": ["Iowa State University Digital Press"],
    "publish_date": "2022-07-19",
    "identifiers": {"opentextbooks": 1206},
    "source_records": ["opentextbooks:1206"],
    "authors": [{"name": "Dr. Amber Anderson"}],
    "contributors": [],
    "lc_classifications": ["QH301", "QE1"],
}

{
    "title": "Introduction to Vacuum Technology",
    "isbn_10": None,
    "isbn_13": "9781942341963",
    "languages": ["eng"],
    "description": "Vacuum systems are critical to many industries. They are vital to establishing required process pressures, establishing a clean process environment, and removing reaction by-products from the process chamber. This text, a revision and expansion of David Hata’s Introduction to Vacuum Technology published in 2008, addresses basic topics in vacuum technology for individuals tasked with maintaining vacuum systems and instructors teaching technician-level courses. The topics are carefully curated to the needs of technicians in a production environment and the types of vacuum systems used, and the accompanying laboratory manual and instructor’s guide support the delivery of lecture-laboratory courses. This book approaches vacuum systems from a pressure regime viewpoint, covering basic vacuum science, followed by the rough vacuum regime, including gas load, pumping mechanisms, pressure measurement, vacuum system construction, and basic troubleshooting concepts. The study of high vacuum systems follows and the same topics are revisited, and finally the topics of leak detection and residual gas analysis are discussed.",
    "subjects": ["Engineering & Technology", "Electrical Engineering"],
    "publishers": ["Milne Open Textbooks"],
    "publish_date": "2023-10-25",
    "identifiers": {"opentextbooks": 1533},
    "source_records": ["opentextbooks:1533"],
    "authors": [
        {"name": "David M. Hata"},
        {"name": "Elena V. Brewer"},
        {"name": "Nancy J. Louwagie"},
    ],
    "contributors": [],
    "lc_classifications": ["TA145", "TK1"],
}
cdrini commented 8 months ago

Perfect thank you, @Billa05 ! That's much easier to validate :) Taking a look now...

cdrini commented 8 months ago

After that, we're looking good! If you have your local environment up and running, go to localhost:8080 , log in, and go to your browser terminal and run the following:

await fetch('/api/import', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify(one_of_your_records)
}).then(r => r.json())

And see if the import completes successfully or if there are any errors!

Billa05 commented 8 months ago

Hey @cdrini, I just got freed from my uni exams. I made the changes and I think it's working now. Should I try with more records to confirm?

I am a bit confused about how Open Library will get the link to the book since there is no link parameter in the metadata. Can you please explain how OL interacts with the API? Or is it that we are only storing information about the book?

and also is there a way to test if the Python script is working before I raise a PR for review?

Screenshot 2023-12-16 223453 Screenshot 2023-12-16 223627

cdrini commented 8 months ago

That looks great! Can you remove Dr. from any author names?

We will be able to get the link using the ID: eg. https://open.umn.edu/opentextbooks/textbooks/1533

Yeah! Try uploading to your local the first 5, and if everything looks good to you, we'll start running them on prod!

cdrini commented 8 months ago

Oh and lets change the name opentextbooks anywhere we have it (identifier, source_record) to open_textbook_library. That's the way they present themselves on their website, so that's likely the better name.

cdrini commented 8 months ago

I added an identifier for open_textbook_library to https://openlibrary.org/config/edition , so this will link correctly once they start being imported :)

Billa05 commented 8 months ago

Oh, I got it now. It will use the ID and OL already has the link. Hmm... nice. did the changes and tried putting more data locally it is working fine.

However, the API does not have the last updated date. How should I proceed further? Should I completely leave that checking part like pressbook import? Also, how can I test the Python script before raising a PR? @cdrini

Billa05 commented 8 months ago

Hey @cdrini, I have submitted the pull request. Would you kindly review it?

Billa05 commented 8 months ago

I think the API has been updated. Could you please check once @cdrini? Now, for every book, I have the URL and the last updated time.

cdrini commented 7 months ago

@Billa05 Can you generate a file with all the books we'll need to import?

Something like PYTHONPATH=. python scripts/import_open_textbook_library.py --limit -1 --dry-run | gzip > open_textbook_library.jsonl.gz. We'll then use that to import it in bulk! Test it with --limit 10 first to make sure the gzipping is working correctly. Then upload the file here :)

Billa05 commented 7 months ago

Hey @cdrini, take a look at this file and let me know if it's correct. I'm not too familiar with Docker, so I had to put in some extra effort to get it right.

for me, after starting the container this command worked: docker exec -it -e PYTHONPATH=. e00f388d8b25 python scripts/import_open_textbook_library.py openlibrary.yml --limit -1 --dry-run | gzip > open_textbook_library.jsonl.gz

FILE: open_textbook_library.jsonl.gz

github-actions[bot] commented 7 months ago

Assignees removed automatically after 14 days.