mischov / meeseeks

An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
MIT License
316 stars 23 forks source link

Moving large XML results to a different process cause Elixir to use all memory and crash #116

Open joshuataylor opened 3 years ago

joshuataylor commented 3 years ago

Elixir: 1.12.1 Erlang: 24.0.2

When using Flow or IEx.pry, Elixir will eat all my memory and crash.

Stream.map and Enum.map work as normal.

Repo example:

git clone https://github.com/joshuataylor/meeseeks_xml_example
cd meeseeks_xml_example
gzip -d Wikipedia-20210626124344.xml.gz # the demo xml file, this is a random dump of Australian politicians :-)

Now change the file lib/meeseeks_xml.ex with the path to the XML file.

Then run:

iex -S mix

:skull: The next step will cause Elixir to eat all your memory and might cause your computer to crash, so be careful; :skull:

Type the following into iex:

MeeseeksXml.example1()
mischov commented 3 years ago

Thank you for the report.

I believe I've confirmed what you said

iex(15)> IEx.pry    
eheap_alloc: Cannot allocate 47021480096 bytes of memory (of type "heap_frag").

Do you believe that this is a flaw in Meeseeks, a flaw in pry, or an incompatibility between the two?

To confirm, you aren't having any problem parsing that file with Meeseeks, right? You can parse it ok and select all the pages, it's just when you try to pry that there's an issue? I see you mention a problem with Flow, too, can you describe what that problem is?

To share what I saw

  1. I started iex and called :erlang.memory() and saw a total usage of 34485976 (bytes)
  2. I read the 22.8MB unzipped XML file with f = File.read!(...)
  3. I called :erlang.memory() and saw a total usage of 57654072
  4. I parsed the file into a Meeseeks.Document with d = Meeseeks.parse(f, :xml)
  5. I called :erlang.memory() and saw a total usage of 107529424
  6. I searched for all the matches with r = Meeseeks.all(d, xpath("//page")) (found 2235 matches)
  7. I called :erlang.memory() and saw a total usage of 117366464

So up to this point parsing the XML document produces a Meeseeks.Document that appears to use about 2.5x the space of the raw string, and another 10MB or so to get the matches. This is a lot of memory, but not unexpected (see #31) since I have to capture the whole of the size of the XML string plus a bunch of maps, parent information, etc, in the document.

Then I Call IEx.pry and the CPU spins up to max and eventually the BEAM shuts down since I don't have 48GB of memory on hand for it. I am not sure what pry is doing but it seems wild to take 117MB total usage and turn it into 48GB of usage.

mischov commented 3 years ago

Further exploration suggests that if I call IEx.pry after I parse the document (step 4 above) it works fine, and takes the total memory usage from 107190608 to 156387424 (it's adding almost 50% more memory usage).

Meeseeks.Results (returned by Meeseeks.all) contain the Meeseeks.Document for which the match is true, but since they all point to the same immutable value (the original parsed document) I don't think there is any additional memory used. Perhaps pry is doing something that causes each Meeseeks.Document in each match (all 2235 of them) to take up its own memory? Perhaps it's doing something else for each of those matches that causes them to take up significantly more memory? I'm not sure.

Maybe @josevalim has some idea of what pry is doing here?

josevalim commented 3 years ago

Pry is most likely starting a separate process and that is causing excessive copying? Does the issue happen if you spawn a task and then access the meeseeks document inside? Flow would have the same issue with copying too.

mischov commented 3 years ago

Yep, that's it. Running a Task.async(fn -> Enum.count(r) end) after step 7 above causes the same "cannot allocate" crash. Looks like when copying the document in each result they each use their own memory (or something similar).

josevalim commented 3 years ago

Yes, you are correct. Here is a simple test I wrote:

IO.inspect :erlang.memory()
data = List.duplicate('a', 1_000_000)
IO.inspect :erlang.memory()
list = List.duplicate(data, 100)
Task.await(Task.async fn ->
  List.flatten(list)
  IO.inspect :erlang.memory
  :ok
end)

You could ask the Erlang team why each reference is copied. Perhaps it is expensive to track references like that, but it is certainly the root cause.

You can fix this by changing Meeseks internal structure. It seems you are referencing the same document over and over again, and some indirection would address this. For example, you could refer to a number and have a map from number => document.

Alternatively, you can work with binaries, large binaries will be referenced instead of copied. If you replace data = List.duplicate('a', 1_000_000) in my example to data = String.duplicate("a", 1_000_000), then it is super fast.

mischov commented 3 years ago

Thank you! I was just in the process of writing a similar List.duplicate test to try and understand what was going on.

That's actually a bit surprising- I incorrectly assumed that if it were going to a single process a value would just be copied once and then be referenced, maintaining a similar memory usage pattern to the original process.

I definitely want to figure out how to address the problem, but I don't know how I can do that via a binary, since parsing the HTML/XML in the first place is the most expensive part and I'm not sure how I could represent the document as a binary without having an expensive time dehydrating/rehydrating.

This will definitely require some thought.

mischov commented 3 years ago

I have always been resistant to returning the document from Rustler as a NIF resource instead of as Elixir data, since I appreciated the freedom that came from being able to see and manipulate the data directly from Elixir, but I guess moving to a NIF resource might be one potential solution because I think they're treated similarly to binaries and not copied.

joshuataylor commented 3 years ago

Wow! Thanks so much for the feedback and rapid investigation!

Given this is an Erlang issue, should this be documented as something that might happen if you pass to another process, and then it can be figured out? I'm not sure what the best solution for this would be, in the mean time what I'm doing is moving the XML parsing into a step, then moving that output to flow (as you should be doing :-)).

mischov commented 3 years ago

I definitely need to document the issue and provide warnings in the appropriate places.

It's a pretty bad problem, but I'm not entirely surprised that it took so long to surface since you need to be parsing a pretty big document, running a selection that returns a lot of matches, and then moving those matches to another process rather than extracting them and moving the result of the extractions to different processes.

The BEAM's behavior here surprises me, but I'm not sure I'd call it an Erlang issue- I'm not sure why they designed it that way but there is probably a reason. Really it's an issue with my design that a certain arrangement of the basic actions one might take can balloon memory so dramatically, even if that arrangement is fairly uncommon, and so it's on me to fix the design. Somehow.