Open alanpaulkwan opened 7 years ago
Another thought was to parallelize the use of tabulizer so that each parallel instance manages destroying objects in the heap. However, using mclapply, foreach or parallel tabulizer doesn't behave properly. Any thoughts would be interesting. Let me know if I have to provide an example.
It seems possible something is not getting cleared. Have you been able to identify at all where the issue is coming from specifically?
I have the same problem. I´m using a for loop for extracting one page at a time from a large pdf at every iteration, and saving that page as a .rds file and eventually I run out of hard disk space. The size of the .rds files are small, less than 3kb each, but my memory usage is a lot larger. For 726 extracted pages I went from 37.3 GB of free space in my hard disk to 29.6 GB. I recover all my memory after restaring R, though.
Just commenting to say I'm experiencing the same behavior.
I have a 200 page pdf filled with mostly tables that I want to convert to R data frames. But even after increasing the heap space to 16GB, I run into a memory issue after only 7-8 pages. pdf
Once the tables are imported to R, I don't need any of java objects and I imagine most use cases are similar. So adding a way to purge those automatically/immediately after extraction would be a desirable feature.
Same issue. I have a 466-page PDF (and each page is a table).
The Camelot Python package does the job for me, but it takes a long time to run (10-20 minutes). Here's how I did it using reticulate
:
library(dplyr)
library(reticulate)
# Create a conda environment with pip and run `pip install camelot-py[cv]` to install the proper version
# of camelot.
reticulate::use_condaenv("envname")
camelot = import("camelot")
# Returns a `TableList` object. Each element is a Table object with a .df attribute that returns a data
# frame.
#
# "lattice" flavor didn't work in this case, but "stream" did.
cam = camelot$read_pdf("tables.pdf", pages = "all", flavor = "stream")
# TableList can't be converted to list or vector, and *apply() and purrr functions don't work on it. So
# use a for loop to pluck each data frame element and add it to a list.
len = cam$n
lst = vector("list", length = len)
# lst ends up being 7.7 MB in size
for (i in 1:len) {
# Remember: Indexes in Python start at 0
lst[[i]] = cam[[i-1]]$df
}
# Combine data frames in list into one big data frame
dat = lst %>%
bind_rows() %>%
# Camelot returns non-syntactic column names. Fix them.
as_tibble(.name_repair = "universal")
Here's equivalent (minus column name repair) Python code (run in the aforementioned conda environment):
import camelot
import os
import pandas as pd
cam = camelot.read_pdf("tables.pdf", pages="all", flavor="stream")
# No column name repair
df = pd.concat(tbl.df for tbl in tbl_lst)
Facing the exact same issue, for what seems to be the same reason: too many extract_table
calls (if I do only one call, I get garbage, so I have to scan my tables column by column... which in turn causes the error).
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded
The easy fix is to restart, but it's not very practical tbh... Hopefully there can be a fix.
FYI, package XLConnect
has the command:
Free up java memory
XLConnect::xlcFreeMe
Maybe something similar could be implemented please?
Dear Tabulizer team,
When extracting hundreds of PDFs, is there a good way to clear memory? The memory use keeps growing and I assume this is due to unreleased objects floating around in the heap.