scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.53k stars 345 forks source link

Parsing div html classes with SwiftSoup? #87

Closed cluelessoodles closed 6 years ago

cluelessoodles commented 6 years ago

I'm trying to use Alamofire and Swiftsoup to display some body text from a website.

The html that I need is in a div with a class and for some reason swiftsoup wont read it.

The html div is < div class="translation-row" > with another < div class="t-english colorblue"> inside and when I try to parse it with Swiftsoup like below, it gives me no text. Is there a special way to parse Ids with swiftsoup? I am able to parse div classes.

My viewcontroller code is:

import UIKit
import Alamofire
import SwiftSoup

class ViewController: UIViewController {

override func viewDidLoad() {
    super.viewDidLoad()

   let pageURL = "http://www.sikhnet.com/hukam"

Alamofire.request(pageURL, method: .post, parameters: nil, encoding: URLEncoding.default).validate(contentType: ["application/x-www-form-urlencoded"]).response { (response) in

    if let data = response.data, let utf8Text = String(data: data, encoding: .utf8) {
        do {
            let html: String = utf8Text
            let doc: Document = try SwiftSoup.parse(html)

            for lineRow in try! doc.select("div.translation-row") {
                print("------------------")
                for englishLine in try! lineRow.select("div.t-english.colorblue") {
                    print(try englishLine.text())
                }
            }

                } catch {}

        } catch let error {
            print(error.localizedDescription)
        }

    }
}

}

override func didReceiveMemoryWarning() {
super.didReceiveMemoryWarning()
// Dispose of any resources that can be recreated.
 }

Another issue I'm having that I'm not sure how to solve. Part of the text is in a non latin font. So do I need to improve a webfont for that text or will Swiftsoup parse it in the characters shown? I havent successfully parsed that div so I dont know if it will show up at all and wanted to ask the correct way to parse html text that was in non latin characters.

scinfu commented 6 years ago

Your url has a space , trying this url "http://www.sikhnet.com/hukam" source html do not have any element named div.translation-row

cluelessoodles commented 6 years ago

Thanks. The space in the url only occured when I copy pasted my code here. It is all 1 line in my test app.

I first tried just .translation-row, then I added div in an attempt to clarify what kind of element it was.

Im still doing it wrong - that much is obvious. Any idea what I might be doing wrong?

cluelessoodles commented 6 years ago

Not sure how often you check in with these, but still need help with this. I'm able to get the h1 of the page, but this specific div.

BMinas commented 6 years ago

You might get more help on https://stackoverflow.com/ for this type of problem. I don't see any issue with SwiftSoup. The Gurmukhi text does not appear to be Unicode, so yes you will need deal with that. I modified you code as follows:

if let data = response.data, let html = String(data: data, encoding: .utf8) {
            do {
                let doc: Document = try SwiftSoup.parse(html)

                let elements = try doc.getAllElements()
                for element in elements {
                    switch element.tagName() {
                    case "div" :
                        print ("div: \(try element.className())  id: \(element.id())")
                    default:
                        let _ = 1
                    }
                }
            } catch let error {
                print(error.localizedDescription)
            }
        }

This will give you a list of all the div and associated id's. I believe that the text that you are trying to scrape is added dynamically.

scinfu commented 6 years ago

@cluelessoodles did you resolve? @BMinas Thank you for support

cluelessoodles commented 6 years ago

@scinfu It turned out that the html I was trying to access was wrapped in a script on the webpage so I ended up using WKScriptMessageHandler delegate to get the text.

scinfu commented 6 years ago

Closed due to inactivity, if necessary re-open.