jonathanvanschenck / biblescrapeway

Scrape bible verses from the web
MIT License
6 stars 1 forks source link

Does it only scraps from biblegateway? Can you modify to search other or all sites? #3

Open RayReddingtonx opened 1 year ago

RayReddingtonx commented 1 year ago

I tried to scrap GNT NA28, and then I realized it only scraps biblegateway site. Can you also give it a list of various websites including https://www.academic-bible.com/ and bible.com to scrap from them as well, not just biblegateway? I thought that it searched whole web, as the search results said. I dont know how to look into the code of this script. If you can improve to include various bible sites then it will be very helpful. Thanks. Here is my working code for scraping entire NT for xml if anyone needs it to create modules for theword.net or esword, or mybible.zone.

import biblescrapeway
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom

# Define the list of NT books with the number of chapters
nt_books = [
    {"bnumber": "40", "bname": "Matthew", "chapters": 28},
    {"bnumber": "41", "bname": "Mark", "chapters": 16},
    {"bnumber": "42", "bname": "Luke", "chapters": 24},
    {"bnumber": "43", "bname": "John", "chapters": 21},
    {"bnumber": "44", "bname": "Acts", "chapters": 28},
    {"bnumber": "45", "bname": "Romans", "chapters": 16},
    {"bnumber": "46", "bname": "1 Corinthians", "chapters": 16},
    {"bnumber": "47", "bname": "2 Corinthians", "chapters": 13},
    {"bnumber": "48", "bname": "Galatians", "chapters": 6},
    {"bnumber": "49", "bname": "Ephesians", "chapters": 6},
    {"bnumber": "50", "bname": "Philippians", "chapters": 4},
    {"bnumber": "51", "bname": "Colossians", "chapters": 4},
    {"bnumber": "52", "bname": "1 Thessalonians", "chapters": 5},
    {"bnumber": "53", "bname": "2 Thessalonians", "chapters": 3},
    {"bnumber": "54", "bname": "1 Timothy", "chapters": 6},
    {"bnumber": "55", "bname": "2 Timothy", "chapters": 4},
    {"bnumber": "56", "bname": "Titus", "chapters": 3},
    {"bnumber": "57", "bname": "Philemon", "chapters": 1},
    {"bnumber": "58", "bname": "Hebrews", "chapters": 13},
    {"bnumber": "59", "bname": "James", "chapters": 5},
    {"bnumber": "60", "bname": "1 Peter", "chapters": 5},
    {"bnumber": "61", "bname": "2 Peter", "chapters": 3},
    {"bnumber": "62", "bname": "1 John", "chapters": 5},
    {"bnumber": "63", "bname": "2 John", "chapters": 1},
    {"bnumber": "64", "bname": "3 John", "chapters": 1},
    {"bnumber": "65", "bname": "Jude", "chapters": 1},
    {"bnumber": "66", "bname": "Revelation", "chapters": 22}
]

# Scrape the verses from the website for all NT books
verses = []
for book in nt_books:
    for chapter in range(1, book["chapters"] + 1):
        chapter_verses = biblescrapeway.query(f"{book['bname']} {chapter}", version="NMB")
        verses += chapter_verses

# Create a new XML tree for the NT books of the NTE version
xml_root = ET.Element("XMLBIBLE")
xml_root.set("xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance")
xml_root.set("biblename", "NMB")
for book in nt_books:
    book_node = ET.SubElement(xml_root, "BIBLEBOOK")
    book_node.set("bname", book["bname"])
    book_node.set("bnumber", book["bnumber"])

    # Initialize variables to keep track of chapter and verse numbers
    current_chapter = 0
    current_verse = 0
    chapter_node = None

    # Iterate over each verse object
    for verse in verses:
        # Check if the verse is in the current book and chapter
        if verse.book == book["bname"] and verse.chapter != current_chapter:
            current_chapter = verse.chapter
            current_verse = 0
            chapter_node = ET.SubElement(book_node, "CHAPTER")
            chapter_node.set("cnumber", str(current_chapter))

        # Increment verse number and add to XML tree
        if verse.book == book["bname"]:
            current_verse += 1
            verse_node = ET.SubElement(chapter_node, "VERS")
            verse_node.set("vnumber", str(current_verse))
            verse_node.text = verse.text.strip()

# Write the XML tree to a file with desired indentation and XML declaration
xml_string = minidom.parseString(ET.tostring(xml_root)).toprettyxml(indent="\t", encoding="utf-8")
with open("NMB.xml", "w", encoding="utf-8") as f:
    f.write(xml_string.decode("utf-8").replace('<?xml version="1.0" ?>', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>'))
jonathanvanschenck commented 1 year ago

Thanks for the request! I have some future plans to include more sites for scraping, but it'll probably be a while before that comes through. I'm also pretty open to PRs if you wanted to implement yourself!

RayReddingtonx commented 1 year ago

Hi, It's not working to scrap NCB New Catholic Bible while creating a book or more than books. Some cache errors

    chapter_verses = biblescrapeway.query(f"{book['bname']} {chapter}", version="NCB")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 , line 69, in query

    _cache.cache( _range, _verses )
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  line 176, in cache
    query_key = self.generate_query_key( range_or_range_string, verse_list[0].version )
jonathanvanschenck commented 1 year ago

Can you provide the full context of what you script / cli command you are running?

RayReddingtonx commented 1 year ago

Same code, (on a separate issue, kindly include the same for this site which has the critical editions https://www.academic-bible.com/en/online-bibles/novum-testamentum-graece-na-28/read-the-bible-text/bibel/text/ )

import biblescrapeway
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom

# Define the list of NT books with the number of chapters
nt_books = [
    {"bnumber": "40", "bname": "Matthew", "chapters": 28},
]

# Scrape the verses from the website for all NT books
verses = []
for book in nt_books:
    for chapter in range(1, book["chapters"] + 1):
        chapter_verses = biblescrapeway.query(f"{book['bname']} {chapter}", version="NCB")
        verses += chapter_verses

# Create a new XML tree for the NT books of the NTE version
xml_root = ET.Element("XMLBIBLE")
xml_root.set("xmlns:xsi", "http://www.w3.org/2001/XMLSchema-instance")
xml_root.set("biblename", "NCB")
for book in nt_books:
    book_node = ET.SubElement(xml_root, "BIBLEBOOK")
    book_node.set("bname", book["bname"])
    book_node.set("bnumber", book["bnumber"])

    # Initialize variables to keep track of chapter and verse numbers
    current_chapter = 0
    current_verse = 0
    chapter_node = None

    # Iterate over each verse object
    for verse in verses:
        # Check if the verse is in the current book and chapter
        if verse.book == book["bname"] and verse.chapter != current_chapter:
            current_chapter = verse.chapter
            current_verse = 0
            chapter_node = ET.SubElement(book_node, "CHAPTER")
            chapter_node.set("cnumber", str(current_chapter))

        # Increment verse number and add to XML tree
        if verse.book == book["bname"]:
            current_verse += 1
            verse_node = ET.SubElement(chapter_node, "VERS")
            verse_node.set("vnumber", str(current_verse))
            verse_node.text = verse.text.strip()

# Write the XML tree to a file with desired indentation and XML declaration
xml_string = minidom.parseString(ET.tostring(xml_root)).toprettyxml(indent="\t", encoding="utf-8")
with open("test.xml", "w", encoding="utf-8") as f:
    f.write(xml_string.decode("utf-8").replace('<?xml version="1.0" ?>', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>'))