ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
542 stars 71 forks source link

memory issues #39

Open alanpaulkwan opened 7 years ago

alanpaulkwan commented 7 years ago

Dear Tabulizer team,

When extracting hundreds of PDFs, is there a good way to clear memory? The memory use keeps growing and I assume this is due to unreleased objects floating around in the heap.

alanpaulkwan commented 7 years ago

Another thought was to parallelize the use of tabulizer so that each parallel instance manages destroying objects in the heap. However, using mclapply, foreach or parallel tabulizer doesn't behave properly. Any thoughts would be interesting. Let me know if I have to provide an example.

leeper commented 7 years ago

It seems possible something is not getting cleared. Have you been able to identify at all where the issue is coming from specifically?

csmontt commented 7 years ago

I have the same problem. I´m using a for loop for extracting one page at a time from a large pdf at every iteration, and saving that page as a .rds file and eventually I run out of hard disk space. The size of the .rds files are small, less than 3kb each, but my memory usage is a lot larger. For 726 extracted pages I went from 37.3 GB of free space in my hard disk to 29.6 GB. I recover all my memory after restaring R, though.

JakeRuss commented 6 years ago

Just commenting to say I'm experiencing the same behavior.

I have a 200 page pdf filled with mostly tables that I want to convert to R data frames. But even after increasing the heap space to 16GB, I run into a memory issue after only 7-8 pages. pdf

Once the tables are imported to R, I don't need any of java objects and I imagine most use cases are similar. So adding a way to purge those automatically/immediately after extraction would be a desirable feature.

SteadyGiant commented 4 years ago

Same issue. I have a 466-page PDF (and each page is a table).

The Camelot Python package does the job for me, but it takes a long time to run (10-20 minutes). Here's how I did it using reticulate:

library(dplyr)
library(reticulate)

# Create a conda environment with pip and run `pip install camelot-py[cv]` to install the proper version
# of camelot.
reticulate::use_condaenv("envname")
camelot = import("camelot")

# Returns a `TableList` object. Each element is a Table object with a .df attribute that returns a data
# frame.
#
# "lattice" flavor didn't work in this case, but "stream" did.
cam = camelot$read_pdf("tables.pdf", pages = "all", flavor = "stream")
# TableList can't be converted to list or vector, and *apply() and purrr functions don't work on it. So
# use a for loop to pluck each data frame element and add it to a list.
len = cam$n
lst = vector("list", length = len)
# lst ends up being 7.7 MB in size
for (i in 1:len) {
  # Remember: Indexes in Python start at 0
  lst[[i]] = cam[[i-1]]$df
}
# Combine data frames in list into one big data frame
dat = lst %>%
  bind_rows() %>%
  # Camelot returns non-syntactic column names. Fix them.
  as_tibble(.name_repair = "universal")

Here's equivalent (minus column name repair) Python code (run in the aforementioned conda environment):

import camelot
import os
import pandas as pd

cam = camelot.read_pdf("tables.pdf", pages="all", flavor="stream")
# No column name repair
df = pd.concat(tbl.df for tbl in tbl_lst)
tchevri commented 8 months ago

Facing the exact same issue, for what seems to be the same reason: too many extract_table calls (if I do only one call, I get garbage, so I have to scan my tables column by column... which in turn causes the error).

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded

The easy fix is to restart, but it's not very practical tbh... Hopefully there can be a fix. FYI, package XLConnect has the command:

Free up java memory

XLConnect::xlcFreeMe

Maybe something similar could be implemented please?