Open jankowtf opened 9 years ago
As a Windows user, this bug is making it impractical for me to webscape using R and has cost hours. I wish I could identify the problem myself and push a commit. Looks like many are running into this, for instance http://stackoverflow.com/questions/31999766/r-memory-issues-while-webscraping-with-rvest. Just wanted to cosign what @rappster said above: a fix would be tremendous and would enable the use of many dependent packages.
Because of this bug I have migrated from R to Python in web crawling. Now I am doing only small scraping in R, all the large scale crawlers developing in Python using requests and lxml.
As far as I am aware, this is an issue only with the Windows build from CRAN (and other sources) due to preprocessor flags not being set in the Makevars.win. The package has been doing garbage collection generally and for complex situations for over a decade. I put a new binary build for R-3.5.0 on the repository www.omegahat.net/R this morning. Also, the github repository now uses Jeroen OOM's RWinLib/libxml2 binary libraries.
Hi Duncan,
it's been a while so I thought I'd check back if you found out anything about the cause of the memory leak when using
XML
on Windows.I'm sure that you have got a thousand more interesting things to do, but I would just so much appreciate if you could fix this bug. It just keeps coming back at me and slows down all of my efforts WRT to Web Scraping. And given the fact that more and more cool packages emerge that depend on your package (e.g. RSelenium or rvest, this issue propagates to all of them as well.
Thank you so much, Janko
Here is a slightly updated version of my investigations:
Preliminaries
Functions
Memory status before anything has ever been requested
Generate additional offline example content
Profiling