scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.52k stars 345 forks source link

Memory allocation issue with 2.7.3 #278

Closed dhritzkiv closed 2 months ago

dhritzkiv commented 2 months ago

Trying to parse certain URLs, such as this one, leads to a memory-related crash in 2.7.3. Downgrading to 2.7.2 avoids this.

Here's a reduced stack trace:

0  libswiftCore.dylib             0x45e274 swift::swift_slowAllocTyped(unsigned long, unsigned long, unsigned long long) (.cold.1) + 16
1  libswiftCore.dylib             0x3a7228 swift_slowDealloc + 106
2  libswiftCore.dylib             0x3a7684 _swift_allocObject_ + 1112
3  REDACTED                       0x8c17a8 specialized _ContiguousArrayBuffer.init(_uninitializedCount:minimumCapacity:) + 4378449832
4  REDACTED                       0x86df0c specialized _ArrayBuffer._consumeAndCreateNew(bufferIsUnique:minimumCapacity:growForAppend:) + 4378107660 (<compiler-generated>:4378107660)
5  REDACTED                       0x8c0828 StringBuilder.append(_:) + 4378445864
6  REDACTED                       0x8cc044 Tokeniser.emit(_:) + 96 (Tokeniser.swift:96)
7  REDACTED                       0x8ceedc TokeniserState.read(_:_:) + 517 (TokeniserState.swift:517)
8  REDACTED                       0x89d27c HtmlTreeBuilder.parseFragment(_:_:_:_:_:) + 46 (Tokeniser.swift:46)
9  REDACTED                       0x8b7a6c specialized static Parser.parseBodyFragment(_:_:) + 120 (Parser.swift:120)
10 REDACTED                       0x8b71e8 static Parser.parseBodyFragment(_:_:) + 4378407400 (<compiler-generated>:4378407400)
aehlke commented 2 months ago

Sorry that was my regression. Can you give me a test case? I can't repro:

  func testURLCrashRegression() throws {
        let html = """
            <!DOCTYPE html>
            <body>
                <a href="https://secure.imagemaker360.com/Viewer/95.asp?id=181293idxIDX&Referer=&referefull="></a>
            </body>
        """
        _ = try SwiftSoup.parse(html)
    }

This ran just fine

aehlke commented 2 months ago

@dhritzkiv unless you can provide a repro test case, please try again with my related fixes here: https://github.com/scinfu/SwiftSoup/pull/276

dhritzkiv commented 2 months ago

@aehlke I should have clarified: trying to parse the contents of the above URL lead to the crash.

aehlke commented 2 months ago

@dhritzkiv I fixed an obvious mistake I made that caused the regression so please give the fix a try and I'm certain it will resolve your issue.

haIIux commented 2 months ago

I believe adding to this issue is better than creating a new issue. I too ran into a recursion error when calling let htmlDocument = try SwiftSoup(parse: html). I am scraping over 200 websites which was running on a 512mb server without an issue until I upgraded these packages to 2.7.0 at the initial time, then up to 2.7.3.

Downgrading to 2.6.1 fixed this recursion issue. I'm not sure what I can provide to help you if needed, I do have a saved Instruments log if that helps.

aehlke commented 2 months ago

My PR fixes it but needs another automated test for verification that I haven't had time to add yet

haIIux commented 2 months ago

Ah I understand! Sorry for the bother!

phuongcsa commented 2 months ago

I experienced the same issue with version 2.7.3, SwiftSoup.parse(html) make the memory usage increases infinitely and app crashes. I downgraded to version 2.7.2 to resolve the problem.

aehlke commented 2 months ago

@phuongcsa @haIIux @dhritzkiv I released the fix: https://github.com/scinfu/SwiftSoup/releases/tag/2.7.4 it includes further utf8view-based optimization. It's a huge improvement, I hope it works well now for everyone. I added test coverage for this issue too.

phuongcsa commented 2 months ago

@aehlke how about pod update latest version? Currently 2.7.3

aehlke commented 2 months ago

@phuongcsa sorry can you submit a PR? I don't know or use cocoapods